Enforce Structured Outputs from LLMs with PydanticAI

April 24, 2025

Enforce Structured Outputs from LLMs with PydanticAI

Khuyen Tran

Introduction – LLMs and Structured Outputs
- Pydantic AI and Structured LLM Outputs
Prerequisites
Core Workflow: Building a Type-Safe Agent
Using the DuckDuckGo Search Tool
Comparison with LangChain Structured Output
- How PydanticAI Handles Structured Output
- How LangChain Handles Structured Output
Final Thoughts

Introduction – LLMs and Structured Outputs

Language models like OpenAI are increasingly being used to build tools that understand and generate human-like text

However, one major challenge is that these models often return unstructured text, which can be unpredictable and difficult to interpret. If you’re expecting clean, structured data, such as a JSON object with keys like ‘first_name’, ‘last_name’, ‘experience’, and ‘primary_skill’, you may find the model returning values in an unstructured form.

Here’s a basic example using OpenAI’s API without any validation to demonstrate this:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

response = client.responses.create(
    model="gpt-4o-mini-2024-07-18",
    instructions="Extract first name, last name, years of experience, and primary skill from the job applicant description.",
    input="Khuyen Tran is a data scientist with 5 years of experience, skilled in Python and machine learning.",
)

print(response.output_text)

This might output:

- **First Name:** Khuyen
- **Last Name:** Tran
- **Years of Experience:** 5 years
- **Primary Skill:** Python and machine learning

While this is readable to a human, it lacks a structured format like JSON, which makes it difficult to reliably extract and run the query in downstream applications.

Pydantic AI and Structured LLM Outputs

PydanticAI helps solve this problem. It combines the power of language models with Pydantic, a Python library for data validation. By doing so, it allows you to define exactly what kind of output you expect and ensures the model sticks to that format.

💻 Get the Code: Open the notebook in Google Colab to run it in your browser, or grab the source from GitHub.

Stay Current with CodeCut

Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

Key Takeaways

Here’s what you’ll learn:

Enforce structured JSON outputs from LLMs using Pydantic models instead of parsing unreliable text
Build type-safe AI agents with automatic validation for production data workflows
Integrate DuckDuckGo search to get real-time web results in structured format
Convert LLM responses directly to pandas DataFrames with proper schema validation
Choose between PydanticAI for simple tasks vs LangChain for complex multi-step workflows

Prerequisites

Make sure you have the following packages installed:

pip install pydantic openai pydantic-ai

You also need access to the OpenAI API with a valid key:

export OPENAI_API_KEY="your-api-key"

Core Workflow: Building a Type-Safe Agent

First, define a Pydantic Model that describes the expected structure of your agent’s output:

from pydantic import BaseModel
from typing import List

class ApplicantProfile(BaseModel):
    first_name: str
    last_name: str
    experience_years: int
    primary_skill: List[str]

This model acts as a contract, ensuring that the language model returns a structured object with the correct fields and types.

Now, use the output_type parameter to connect this model to your agent:

from pydantic_ai import Agent

agent = Agent(
    'gpt-4o-mini-2024-07-18',
    system_prompt='Extract name, years of experience, and primary skill from the job applicant description.',
    output_type=ApplicantProfile,
)

result = agent.run_sync('Khuyen Tran is a data scientist with 5 years of experience, skilled in Python and machine learning.')
print(result.output)

Output:

first_name='Khuyen' last_name='Tran' experience_years=5 primary_skill=['Python', 'machine learning']

This structured output is safe to pass directly into downstream applications without modification.

result.output returns a Pydantic object. To convert it into a standard Python dictionary for further use, call:

result.output.model_dump()

Output:

{
  "first_name": "Khuyen",
  "last_name": "Tran",
  "experience_years": 5,
  "primary_skill": [
    "Python",
    "machine learning"
  ]
}

You can now easily integrate this into other data workflows. For example, to convert the output into a pandas DataFrame:

import pandas as pd

df = pd.DataFrame([result.output.model_dump()])
df

Output:

  first_name last_name  experience_years     primary_skill
0     Khuyen      Tran                 5            Python
1     Khuyen      Tran                 5  machine learning

Using the DuckDuckGo Search Tool

Have you ever tried to make your AI app respond to current events or user queries with real-world data without managing a custom search backend?

PydanticAI supports integrating tools like DuckDuckGo search to enhance your AI agents with live web results.

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.common_tools.duckduckgo import duckduckgo_search_tool
from typing import List

class UnemploymentDataSource(BaseModel):
    title: List[str]
    description: List[str]
    url: List[str]

# Define the agent with DuckDuckGo search tool
search_agent = Agent(
    'gpt-4o-mini-2024-07-18',
    tools=[duckduckgo_search_tool()],
    system_prompt='Search DuckDuckGo and return links or resources that match the query.',
    output_type=UnemploymentDataSource,
)

# Run a search for unemployment rate dataset
unemployment_result = search_agent.run_sync(
    'Monthly unemployment rate dataset for US from 2018 to 2024'
)

print(unemployment_result.output)

Example output:

title=[
  'Civilian unemployment rate - U.S. Bureau of Labor Statistics',
  'Databases, Tables & Calculators by Subject - U.S. Bureau of Labor Statistics',
  'Unemployment Rate (UNRATE) | FRED | St. Louis Fed',
  'US Unemployment Rate Monthly Analysis: Employment Situation - YCharts',
  'U.S. Unemployment Rate 1991-2025 - Macrotrends'
]
description=[
  'The U.S. Bureau of Labor Statistics provides information on the civilian unemployment rate.',
  'Access various data tables and calculators related to employment situations in the U.S.',
  "Access historical unemployment rates and data through the St. Louis Fed's FRED database.",
  'In-depth view into historical data of the U.S. unemployment rate including projections.',
  'Details on U.S. unemployment rate trends and statistics from 1991 to 2025.'
]
url=[
  'https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm',
  'https://www.bls.gov/data/',
  'https://fred.stlouisfed.org/series/UNRATE/',
  'https://ycharts.com/indicators/us_unemployment_rate',
  'https://www.macrotrends.net/global-metrics/countries/USA/united-states/unemployment-rate'
]

This output is fully structured and aligns with the UnemploymentDataSource schema. It makes the data easy to load into tables or use in downstream analytics workflows without additional transformation.

Comparison with LangChain Structured Output

How PydanticAI Handles Structured Output

PydanticAI returns Pydantic objects directly, so you can immediately access structured fields like cook_time without extra parsing.

from typing import Optional, List
from pydantic import BaseModel
from pydantic_ai import Agent

class RecipeExtractor(BaseModel):
    ingredients: List[str]
    instructions: str
    cook_time: Optional[str]

recipe_agent = Agent(
    "gpt-4o-mini-2024-07-18",
    system_prompt="Pull ingredients, instructions, and cook time.",
    output_type=RecipeExtractor,
)

recipe_result = recipe_agent.run_sync(
    "Sugar, flour, cocoa, eggs, and milk. Mix, bake at 350F for 30 min."
)
print(recipe_result.output.cook_time)
# 30 minutes

PydanticAI simplifies standalone LLM tasks, meaning tasks where you prompt a model once and immediately use the structured output without needing multiple steps, chaining, or external orchestration.

How LangChain Handles Structured Output

LangChain binds a Pydantic model to the tool, but you must manually extract values from tool_calls, adding an extra step.

from typing import Optional, List
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the chat model
model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0)

# Bind the response formatter schema
model_with_tools = model.bind_tools([RecipeExtractor])

# Create a list of messages to send to the model
messages = [
    SystemMessage("Pull ingredients, instructions, and cook time."),
    HumanMessage("Sugar, flour, cocoa, eggs, and milk. Mix, bake at 350F for 30 min."),
]

# Invoke the model with the prepared messages
ai_msg = model_with_tools.invoke(messages)

# Access the tool calls made during the model invocation
print(ai_msg.tool_calls[0]['args']['cook_time'])
# 30 minutes

LangChain is better suited for multi-step workflows, such as combining several tools, using routing logic, or building custom chains.

Final Thoughts

I find PydanticAI to be an easy-to-use tool for structuring LLM outputs effectively. It helps keep workflows organized and predictable, which can significantly enhance focus and efficiency.

With just a few lines of code, you get robust schema validation that integrates naturally into Python-based pipelines. That makes it a practical choice for data scientists aiming to move beyond simple prototypes.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut

Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

Build a Private Email Q&A System with Local and Cloud LLMs

July 8, 2026

Stop Hand-Tuning Prompts: Auto-Optimize an LLM Classifier with DSPy

June 11, 2026

Add Long-Term Memory to LLM Applications with Mem0

June 3, 2026

Enforce Structured Outputs from LLMs with PydanticAI