evaluation

Auto-created tag for evaluation

Stop Hand-Tuning Prompts: Auto-Optimize an LLM Classifier with DSPy

Leave a Comment / Blog, LLM / Khuyen Tran

Table of Contents

The Problem with Hand-Written Prompts
What Is DSPy?
Setup: Banking Query Classification
Define the Task with a Signature
Run the Task with DSPy Modules
Evaluate the Baseline
Optimize the Classifier with Examples
Compare Before vs. After
Save and Reuse the Optimized Program
Final Thoughts

The Problem with Hand-Written Prompts
Model choice matters, but prompt quality matters too. If the prompt is vague or hard to maintain, the classifier can still produce wrong labels.
A typical example is a prompt written as one string:
prompt = """
Classify this banking query as:
– card_arrival
– card_delivery_estimate
– card_not_working
– card_swallowed

Return only the label.

Query: My new card still has not arrived after two weeks.
Intent:
"""

This works for a simple demo, but real queries quickly reveal cases the prompt does not handle well.
For example:
My new card arrived, but it does not work at the ATM.

Because the prompt does not clarify this edge case, the model may focus on “new card” and return:

Output
card_arrival

But the correct intent is:

Output
card_not_working

You can patch the prompt with another rule, but that creates a new problem: every change needs to be retested. A fix for one visible mistake can hide new failures elsewhere.
Without a dataset and metric, you cannot tell whether the classifier improved overall.
DSPy replaces manual prompt tweaking with four repeatable steps:

Define the task as a program.
Evaluate the program with examples and a metric.
Let an optimizer improve the program.
Compare the score before and after.

This article walks through that loop by building a small banking intent classifier.

💻 Get the Code: Open the notebook in Google Colab to run it in your browser, or grab the source from GitHub.

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

What Is DSPy?
DSPy is a Python framework for programming LLM workflows instead of hand-writing prompts.
It breaks an LLM workflow into explicit parts:

Signatures define the inputs and outputs.
Modules run the task with strategies such as Predict or ChainOfThought.
Metrics score the outputs.
Optimizers improve the program using examples and metrics.

This structure makes prompt engineering measurable. You can compare versions, optimize against a metric, and reuse the improved program.
Manual prompt DSPy program
————- ————
Task description —> Signature
Prompting style —> Module
Manual inspection —> Metric
Prompt tweaking —> Optimizer

Setup: Banking Query Classification
Install the libraries used in this tutorial:
pip install -U dspy pandas python-dotenv

This article uses dspy v3.2.1, pandas v2.3.1, and python-dotenv v1.1.1.
This tutorial uses OpenAI’s gpt-4o-mini through DSPy’s language model interface. Store your API key in a .env file:
OPENAI_API_KEY=your-openai-api-key

Then load the environment variables and configure DSPy:
from typing import Literal

import dspy
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

We will use BANKING77, a dataset of banking support questions labeled with customer intents. To keep loading simple, this tutorial reads the raw CSV files from the original PolyAI repository.
TRAIN_URL = "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv"
TEST_URL = "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv"

train_df = pd.read_csv(TRAIN_URL)
test_df = pd.read_csv(TEST_URL)

print(train_df.head())

Output
text category
0 I am still waiting on my card? card_arrival
1 What can I do if my card still hasn't arrived … card_arrival
2 I have been waiting over a week. Is the card s… card_arrival
3 Can I track my card while it is in the process… card_arrival
4 How do I know if I will get my card, or if it … card_arrival

To keep the example small, we will use four card-support intents instead of all 77 labels. The subset is still useful because card_arrival and card_delivery_estimate are similar enough to create meaningful mistakes.
INTENTS = [
"card_arrival",
"card_delivery_estimate",
"card_not_working",
"card_swallowed",
]

def sample_intents(data: pd.DataFrame, examples_per_intent: int) -> pd.DataFrame:
return (
data[data["category"].isin(INTENTS)]
.groupby("category", group_keys=False)
.sample(n=examples_per_intent, random_state=42)
.reset_index(drop=True)
)

train_sample = sample_intents(train_df, examples_per_intent=8)
dev_sample = sample_intents(test_df, examples_per_intent=10)

print(train_sample["category"].value_counts())

Output
category
card_arrival 8
card_delivery_estimate 8
card_not_working 8
card_swallowed 8
Name: count, dtype: int64

Before evaluation, prepare the data for DSPy:

Store each query-label pair as a dspy.Example.
Mark query as the input field with .with_inputs("query").
Keep intent as the target label DSPy will compare against the prediction.

def to_dspy_examples(data: pd.DataFrame) -> list[dspy.Example]:
return [
dspy.Example(query=row.text, intent=row.category).with_inputs("query")
for row in data.itertuples(index=False)
]

trainset = to_dspy_examples(train_sample)
devset = to_dspy_examples(dev_sample)

Let’s inspect one row to confirm that only query is marked as model input:
example = trainset[0]

print("Full example:")
print(example)

print("\nWhat the model receives:")
print(example.inputs())

print("\nExpected answer kept for scoring:")
print(example.intent)

Output
Full example:
Example({'query': 'If I ordered my new card last week, how much longer should I wait to receive it?', 'intent': 'card_arrival'}) (input_keys={'query'})

What the model receives:
Example({'query': 'If I ordered my new card last week, how much longer should I wait to receive it?'}) (input_keys={'query'})

Expected answer kept for scoring:
card_arrival

Notice that the full example contains both query and intent, but example.inputs() contains only query. This prevents the model from seeing the expected answer during prediction.
Define the Task with a Signature
A DSPy signature makes the task explicit. Instead of hiding the task inside a prompt string, you define the input fields, output fields, and output constraints in code.
The signature below defines the task schema:

Input field: query
Output field: intent
Allowed outputs: card_arrival, card_delivery_estimate, card_not_working, card_swallowed
Field descriptions: short hints DSPy can use when prompting the model

class ClassifyBankingIntent(dspy.Signature):
"""Classify a banking support query into one of the allowed intents."""

query: str = dspy.InputField(desc="Customer support query")
intent: Literal[
"card_arrival",
"card_delivery_estimate",
"card_not_working",
"card_swallowed",
] = dspy.OutputField(desc="Predicted banking intent")

The typed intent field is how DSPy keeps outputs within the allowed labels. For a dedicated way to enforce and validate typed LLM outputs with Python types, see Enforce Structured Outputs from LLMs with PydanticAI.
Run the Task with DSPy Modules
A DSPy module turns the signature into callable code.
Different modules run the same task in different ways:

Predict returns the output directly.
ChainOfThought adds a reasoning step before the output.
ReAct can call tools before answering.

Because they can share the same signature, you can switch strategies without redefining the task.
Predict: Direct Prediction
Predict is the simplest module. It asks the model to return the output directly.
predict_classifier = dspy.Predict(ClassifyBankingIntent)

prediction = predict_classifier(
query="The ATM kept my card and did not return it. How do I get it back?"
)

print(f'Intent: {prediction.intent}')

Output
Intent: card_swallowed

This matches the query and stays within the allowed intent labels.
ChainOfThought: Reason Before Predicting
ChainOfThought keeps the same input and output fields, but adds a reasoning step before the prediction:
cot_classifier = dspy.ChainOfThought(ClassifyBankingIntent)

prediction = cot_classifier(
query="The ATM kept my card and did not return it. How do I get it back?"
)

print(f'Reasoning: {prediction.reasoning}')
print(f'Intent: {prediction.intent}')

Output
Reasoning: The customer's query indicates that their card was not returned by an ATM, which suggests that the card was likely swallowed by the machine. The customer is seeking information on how to retrieve their card, which aligns with the intent of a card being swallowed by the ATM.
Intent: card_swallowed

Unlike Predict, ChainOfThought exposes the reasoning before the final label. The predicted intent is still card_swallowed.
ReAct: Use Tools Before Answering
ReAct is useful when the model needs to use tools before answering. In the example below:

lookup_transfer_status is a Python tool that retrieves transfer details.
dspy.ReAct decides when to call that tool and uses the result to answer.

def lookup_transfer_status(reference_id: str) -> str:
"""Return transfer status for a reference ID."""
transfers = {
"TRX-1042": "Completed on March 12. Recipient bank confirmed receipt.",
"TRX-2048": "Pending review. Expected completion within 1 business day.",
}
return transfers.get(reference_id, "Transfer reference not found.")

react_agent = dspy.ReAct(
signature="query -> answer", # receive query, return answer
tools=[lookup_transfer_status], # allow transfer lookup
max_iters=3, # stop after 3 iterations
)

response = react_agent(
query="Did transfer TRX-1042 reach the recipient?"
)

print(response.answer)

Output
Yes, transfer TRX-1042 has reached the recipient.

Notice that the model answers using the lookup result instead of guessing from the prompt alone.
Now that the modules are defined, the next step is to score the classifier versions and optimize one of them.
Evaluate the Baseline
Before optimizing the classifier, we need to measure the baseline. Here, the metric is simple: a prediction is correct when the predicted intent matches the expected label.
def intent_exact_match(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> bool:
return example.intent == prediction.intent

Now create a DSPy evaluator:
evaluate = dspy.Evaluate(
devset=devset, # examples to score
metric=intent_exact_match, # scoring function
num_threads=4, # parallel model calls
display_progress=True, # show progress bar
display_table=5, # show sample predictions
)

Use the same evaluator to compare Predict and ChainOfThought on the dev set:
predict_score = evaluate(predict_classifier)
print(f"Predict score: {predict_score.score}")

Because display_table=5 is set, the evaluator prints a sample of predictions before the score:

query
example_intent
pred_intent
intent_exact_match

My card still hasn’t arrived after 2 weeks. Is it lost?
card_arrival
card_arrival
✅ True

I’ve been waiting longer than expected for my card.
card_arrival
card_delivery_estimate
❌ False

I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?
card_arrival
card_delivery_estimate
❌ False

I think something went wrong with my card delivery as I haven’t received it yet.
card_arrival
card_delivery_estimate
❌ False

My card has not arrived yet.
card_arrival
card_arrival
✅ True

… 35 more rows not displayed …

Output
Predict score: 77.5

Run the same evaluator on ChainOfThought:
cot_score = evaluate(cot_classifier)
print(f"ChainOfThought score: {cot_score.score}")

query
example_intent
pred_intent
intent_exact_match

My card still hasn’t arrived after 2 weeks. Is it lost?
card_arrival
card_arrival
✅ True

I’ve been waiting longer than expected for my card.
card_arrival
card_arrival
✅ True

I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?
card_arrival
card_delivery_estimate
❌ False

I think something went wrong with my card delivery as I haven’t received it yet.
card_arrival
card_delivery_estimate
❌ False

My card has not arrived yet.
card_arrival
card_arrival
✅ True

… 35 more rows not displayed …

Output
ChainOfThought score: 80.0

The displayed rows make the comparison easier to inspect: you can see which examples matched the expected intent and which ones failed. In this run, ChainOfThought scores higher because the reasoning step helps with some ambiguous delivery queries.
Optimize the Classifier with Examples
Once the metric shows where the baseline fails, DSPy can use training examples to search for a better version of the program.
DSPy provides several optimizer options depending on how much search you want:

BootstrapFewShot improves the prompt by adding better examples.
MIPROv2 improves the prompt by tuning both instructions and examples.
GEPA improves the prompt by using feedback from previous attempts.

This article uses BootstrapFewShot because it is the simplest optimizer for this setup. It uses the training set and metric to choose examples that make the prompt stronger.
Few-shot examples are useful when the label name alone is not enough.
For example, card_arrival could sound like a successful delivery, but this example shows what it means in the dataset:
Query: My card has not arrived yet.
Intent: card_arrival

The label refers to questions or problems about card delivery. BootstrapFewShot helps find examples like this and add them to the prompt:

from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
metric=intent_exact_match, # score each candidate
max_bootstrapped_demos=4, # generated examples to keep
max_labeled_demos=8, # labeled examples to include
max_rounds=1, # bootstrap attempts per example
)

optimized_classifier = optimizer.compile(
student=cot_classifier,
trainset=trainset,
)

Inspect the examples added to the optimized prompt:
predictor = optimized_classifier.predictors()[0]

bootstrapped_demos = [
demo for demo in predictor.demos
if getattr(demo, "augmented", False)
]

for i, demo in enumerate(bootstrapped_demos, start=1):
print(f"Bootstrapped demo {i}")
print("Query:", demo.query)
print("Reasoning:", demo.reasoning)
print("Intent:", demo.intent)
print()

Output
Bootstrapped demo 1
Query: If I ordered my new card last week, how much longer should I wait to receive it?
Reasoning: The query asks about the expected waiting time for a newly ordered card, which suggests that the customer is inquiring about when it will arrive.
Intent: card_arrival

Bootstrapped demo 2
Query: Is there a reason my new card hasn't arrived?
Reasoning: The query is asking about the status of a new card that has not been received yet, indicating concern over the arrival of the card.
Intent: card_arrival

Bootstrapped demo 3
Query: I still haven't gotten my new card. When will it get here?
Reasoning: The query expresses concern about not receiving a new card yet and asks for information on its arrival. This indicates a focus on the status of the card's delivery.
Intent: card_arrival

Bootstrapped demo 4
Query: My card hasn't arrived in the mail yet. I ordered it two weeks ago. What can I do?
Reasoning: The customer is inquiring about the status of their card, which they have not received yet after ordering it two weeks ago. This indicates they are concerned about the arrival of their card.
Intent: card_arrival

The demos teach a consistent pattern: when the customer asks whether a new card has arrived, where it is, or what to do after waiting, the expected intent is card_arrival.
Compare Before vs. After
Evaluate the optimized classifier on the same dev set:
optimized_score = evaluate(optimized_classifier)

scores = pd.DataFrame(
[
{"program": "Predict", "score": predict_score.score},
{"program": "ChainOfThought", "score": cot_score.score},
{"program": "BootstrapFewShot + ChainOfThought", "score": optimized_score.score},
]
)

print(scores)

Output
program score
0 Predict 77.5
1 ChainOfThought 80.0
2 BootstrapFewShot + ChainOfThought 87.5

Nice! The optimized classifier performs best in this run, improving from 80.0 with ChainOfThought to 87.5 after adding optimized few-shot examples.
You can also inspect individual misses to understand what still fails:
for example in devset:
prediction = optimized_classifier(query=example.query)

if prediction.intent != example.intent:
print("Query:", example.query)
print("Expected:", example.intent)
print("Predicted:", prediction.intent)
print()

Output
Query: Is there tracking info available?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: Where is the tracking number for the card you sent me?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: Do you know if there is a tracking number for the new card you sent me?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: I'm just wondering when my card will get here.
Expected: card_delivery_estimate
Predicted: card_arrival

Query: I am waiting for my card to arrive.
Expected: card_delivery_estimate
Predicted: card_arrival

Most misses are between card_arrival and card_delivery_estimate. That makes sense: both intents mention waiting for a card, tracking, or delivery timing.
To improve this, we could add more labeled examples that separate “my card has not arrived” from “how long does delivery take?”
Save and Reuse the Optimized Program
Optimization can take time and spend LLM tokens, so you do not want to run it every time you classify a query. Instead, save the optimized classifier once so it can be loaded later:
save_path = "optimized_banking_classifier.json"
optimized_classifier.save(save_path)

When you need the classifier again, rebuild the same DSPy module and load the saved file:
loaded_classifier = dspy.ChainOfThought(ClassifyBankingIntent)
loaded_classifier.load(path=save_path)

prediction = loaded_classifier(
query="I have been waiting two weeks and my new card still has not arrived."
)

print(prediction.intent)

Output
card_arrival

This skips optimization and reuses the same saved prompt, making inference faster and reproducible.
Final Thoughts
DSPy is worth using when an LLM workflow will run repeatedly and quality matters. It is especially useful when you have:

Labeled examples
A metric
Several prompt or module versions to compare
A task that will evolve over time

It is probably too much for one-off prompts, quick brainstorming, or tasks where you do not have examples to evaluate against.
In this article, we followed the core DSPy workflow:

Define the task
Run the task with different strategies
Evaluate each version
Optimize the workflow with examples
Save the optimized result for reuse

Once this workflow is familiar, you can extend it with larger dev sets, more intent labels, and richer metrics. For more advanced optimization, explore DSPy’s MIPROv2 and GEPA docs.
Related Tutorials

Structured Output Tools for LLMs: Instructor, PydanticAI, LangChain, Outlines, and Guidance Compared: Compares libraries that force LLMs to return valid, typed outputs, the same problem DSPy signatures solve.
Build Production-Ready RAG Systems with MLflow Quality Metrics: Measures LLM output quality with metrics, complementing DSPy’s evaluate-and-optimize loop.