Structured outputs with OpenAI and Pydantic
Marcin Tabisz
Large language models are remarkably capable at reading and understanding documents. Take a photo of a receipt or a scanned invoice, and it will tell you everything on it - the vendor name, the line items, the total, the tax. The problem is that it will tell you in prose. And prose, however accurate, is not something you can easily load into a database, compare against a ground truth label, or use to compute an F1 score.
This was exactly the challenge we ran into in the FewTuRe research project, which dealt with few-shot fine-tuning for information extraction from receipts and invoices. We needed the model to extract structured information from documents, but we also needed that information to reliably arrive in a machine-readable format that matched our ground truth data schema exactly. A response that put the total under "total_amount" one time and "total" the next time was just as useless to our evaluation pipeline as a wrong answer. Consistency wasn't a nice-to-have. It was a hard requirement. That experience made one thing very clear to us: getting the right answer is only half the problem. Getting it in the right shape is the other half.
This post is about solving that second half. We'll walk through how OpenAI's structured outputs feature, combined with Pydantic and the OpenAI Python SDK, gives you a reliable and elegant way to enforce exactly the output format your application needs.
To understand why structured outputs matter, it helps to think about what LLMs actually produce by default: a stream of tokens that form natural language text. Even when you prompt a model to "return a JSON object", you are essentially asking it nicely. There is no guarantee.
In practice, this means you get responses like:
Valid JSON - great, but don't count on it every time,
JSON wrapped in a markdown code block that you have to strip out first,
JSON with hallucinated or inconsistently named keys,
A helpful sentence before the JSON that breaks your parser,
Subtly wrong types, such as a number returned as a string, a list returned as a comma-separated value.
For a prototype or a demo, this is manageable. You write a bit of post-processing logic, catch the edge cases, and move on. But for a production pipeline, especially one where the output feeds directly into a training loop or an evaluation framework, this kind of fragility is a serious problem.
In our information extraction project, our ground truth data had a well-defined schema. Every receipt and invoice was labeled with the same set of fields, the same types, the same structure. For the model's output to be comparable to that ground truth, it had to match that schema precisely. We couldn't afford a parser that worked 95% of the time. We needed something we could actually depend on. That is where structured outputs come in.
What are structured outputs?
Structured outputs refer to the ability to constrain an LLM's response to conform to a predefined schema, not by asking it to, but by enforcing it at the generation level. Rather than hoping the model returns valid JSON, the API guarantees it, by using the schema you provide to guide the token generation process itself. This is a meaningful shift. It moves schema compliance from being a prompt engineering problem to being an infrastructure guarantee.
The schema you define describes exactly what the output should look like: which fields are present, what types they are, whether they are required or optional, and how nested objects should be structured. The model fills in that schema with the information it extracts and what comes back is valid, consistent, and ready to be parsed into a typed Python object.
This naturally leads us to the tools that make this possible in Python: Pydantic for defining and validating the schema, and the OpenAI Python SDK for enforcing it at inference time. Let's look at both.
Pydantic
Pydantic is a Python library for data validation and settings management using type annotations. At its core, it lets you describe the shape and types of your data as a Python class, and then validates that any data you pass in actually conforms to that description. If it does not, Pydantic raises a clear, descriptive error rather than letting bad data silently propagate through your system.
It has become something of a standard building block in modern Python applications. You will find it powering FastAPI's request and response validation, configuration management in ML frameworks, and increasingly, schema definitions for LLM outputs. The reason it is so widely adopted comes down to a few things:
You define your data models as classes using standard Python type hints, which makes it accessible and easy to use.
It is strict where it needs to be. Types are enforced and coerced where possible, and violations surface immediately with useful error messages.
It is composable. Models can be nested inside other models, making it straightforward to represent complex, hierarchical data structures.
It generates JSON Schema automatically. This last point, as we will see, is exactly what makes it so powerful in combination with the OpenAI SDK.
Defining a model
The foundation of Pydantic is the BaseModel class. You define your data structure by subclassing it and declaring fields as class attributes with type annotations:
from pydantic import BaseModel
from typing import Optional
class Vendor(BaseModel):
name: str
address: str | None = None
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Receipt(BaseModel):
vendor: Vendor
date: Optional[str] = None
line_items: list[LineItem] = []
subtotal: Optional[float] = None
tax: Optional[float] = None
total: float
Notice how we can simply nest one class in the other to create a hierarchical record field for the receipt. At first glance, this might look similar to dataclasses, but Pydantic goes a step further. Once the model is instantiated, every field is validated against its declared type. If you try creating a Receipt object with a total being a string containing the price, the following error would appear:
ValidationError: 1 validation error for Receipt
total
Input should be a valid number, unable to parse string as a number
In the definition of Receipt, you might have also noticed the use of Optional and default values. These are particularly useful, if there are fields, which do not always need to appear in an extraction. In this case, if the model doesn’t find any dates or tax information in the receipt, it will simply leave a None value. Similarly, if there are no line items detected, the corresponding field will simply be equal to an empty list. All of this keeps the extraction clean and manageable for evaluation.
OpenAI's structured outputs
OpenAI introduced Structured Outputs in August 2024, available on gpt-4o-mini and gpt-4o starting with the 2024-08-06 snapshot. The feature has since been extended to newer models in the GPT-4o and o-series family.
When you provide a JSON schema to the API alongside your request, OpenAI uses a technique called constrained decoding. At each step of token generation, the model's output distribution is filtered so that only tokens that are valid continuations of the schema are considered. The model still decides what to say, but the shape of what it says is enforced by the schema at every single token. The key word here is enforced - this is not prompt engineering, it is a constraint applied at the generation level.
The OpenAI Python SDK gives you two ways to work with structured outputs, and it is worth understanding both:
1. response_format with type: json_schema
This is the lower-level approach. You pass a JSON schema directly as part of the request, and the API returns a JSON string that you then parse yourself:
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "user", "content": "Extract the receipt information from the following text: ..."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "Receipt",
"schema": Receipt.model_json_schema(),
"strict": True
}
}
)
data = json.loads(response.choices[0].message.content)
receipt = Receipt(**data)
As you can see, this requires you to tediously write the JSON schema yourself, which is not only tiring, but also doesn’t include the validation property provided by Pydantic. That’s exactly where the second approach comes in.
2. client.beta.chat.completions.parse()
This is the higher-level, recommended approach when working with Pydantic. You pass your Pydantic model class directly to the SDK, and it handles schema generation, JSON parsing, and model instantiation for you. What you get back is a fully validated Pydantic object, ready to use:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "user", "content": "Extract the receipt information from the following text: ..."}
],
response_format=Receipt
)
receipt = response.choices[0].message.parsed
Notice that response.choices[0].message.parsed gives you a Receipt instance directly. No JSON parsing, no manual instantiation, no validation step — the SDK takes care of all of it.
Conclusion
Building a pipeline that extracts structured information from real-world documents is not just a modeling problem - it is a data engineering problem. Getting an LLM to correctly identify a vendor name or a line item total is one challenge. Getting it to return that information in a consistent, machine-readable format that your evaluation framework, training loop, and database can all reliably consume is another challenge entirely, and one that is easy to underestimate until it starts causing problems in production.
That was a significant lesson from our Few-Shot Fine-Tuning project. The model could read the documents. What we needed was a way to ensure that what it returned always looked the same, the same fields, the same types, the same structure, regardless of how varied or messy the input was. Structured outputs, combined with Pydantic and the OpenAI Python SDK, gave us exactly that.
In this post we covered the full picture of how this works in practice. From Pydantic as a schema definition layer, on how to define your data structure once as a Python class, get validation for free, and automatically generate the JSON Schema that the API needs, to OpenAI's structured outputs feature, how constrained decoding works under the hood and why it is a fundamentally more reliable approach than prompt-based formatting.
The combination of these tools is one of the cleanest patterns available right now for building reliable LLM-powered data pipelines. It gives you a single source of truth for your data structure, eliminates an entire class of parsing failures, and makes evaluation and downstream processing dramatically simpler.