Structured outputs with OpenAI and Pydantic
Marcin Tabisz
Large language models are remarkably capable at reading and understanding documents. Take a photo of a receipt or a scanned invoice, and it will tell you everything on it - the vendor name, the line items, the total, the tax. The problem is that it will tell you in prose. And prose, however accurate, is not something you can easily load into a database, compare against a ground truth label, or use to compute an F1 score. This was exactly the challenge we ran into in the FewTuRe research project, which dealt with few-shot fine-tuning for information extraction from receipts and invoices. We needed the model to extract structured information from documents, but we also needed that information to reliably arrive in a machine-readable format that matched our ground truth data schema exactly. A response that put the total under "total_amount" one time and "total" the next time was just as useless to our evaluation pipeline as a wrong answer. Consistency wasn't a nice-to-have. It was a hard requirement. That experience made one thing very clear to us: getting the right answer is only half the problem. Getting it in the right shape is the other half. This post is about solving that second half. We'll walk through how OpenAI's structured outputs feature, combined with Pydantic and the OpenAI Python SDK, gives you a reliable and elegant way to enforce exactly the output format your application needs. To understand why structured outputs matter, it helps to think about what LLMs actually produce by default: a stream of tokens that form natural language text. Even when you prompt a model to "return a JSON object", you are essentially asking it nicely. There is no guarantee. In practice, this means you get responses like: Valid JSON - great, but don't count on it every time, JSON wrapped in a markdown code block that you have to strip out first, JSON with hallucinated or inconsistently named keys, A helpful sentence before the JSON that breaks your parser, Subtly wrong types, such as a number returned as a string, a list returned as a comma-separated value. For a prototype or a demo, this is manageable. You write a bit of post-processing logic, catch the edge cases, and move on. But for a production pipeline, especially one where the output feeds directly into a training loop or an evaluation framework, this kind of fragility is a serious problem. In our information extraction project, our ground truth data had a well-defined schema. Every receipt and invoice was labeled with the same set of fields, the same types, the same structure. For the model's output to be comparable to that ground truth, it had to match that schema precisely. We couldn't afford a parser that worked 95% of the time. We needed something we could actually depend on. That is where structured outputs come in.