Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

evaluation

Auto-created tag for evaluation

Stop Hand-Tuning Prompts: Auto-Optimize an LLM Classifier with DSPy

Table of Contents

The Problem with Hand-Written Prompts
What Is DSPy?
Setup: Banking Query Classification
Define the Task with a Signature
Run the Task with DSPy Modules
Evaluate the Baseline
Optimize the Classifier with Examples
Compare Before vs. After
Save and Reuse the Optimized Program
Final Thoughts

The Problem with Hand-Written Prompts
Model choice matters, but prompt quality matters too. If the prompt is vague or hard to maintain, the classifier can still produce wrong labels.
A typical example is a prompt written as one string:
prompt = """
Classify this banking query as:
– card_arrival
– card_delivery_estimate
– card_not_working
– card_swallowed

Return only the label.

Query: My new card still has not arrived after two weeks.
Intent:
"""

This works for a simple demo, but real queries quickly reveal cases the prompt does not handle well.
For example:
My new card arrived, but it does not work at the ATM.

Because the prompt does not clarify this edge case, the model may focus on “new card” and return:

Output
card_arrival

But the correct intent is:

Output
card_not_working

You can patch the prompt with another rule, but that creates a new problem: every change needs to be retested. A fix for one visible mistake can hide new failures elsewhere.
Without a dataset and metric, you cannot tell whether the classifier improved overall.
DSPy replaces manual prompt tweaking with four repeatable steps:

Define the task as a program.
Evaluate the program with examples and a metric.
Let an optimizer improve the program.
Compare the score before and after.

This article walks through that loop by building a small banking intent classifier.

💻 Get the Code: Open the notebook in Google Colab to run it in your browser, or grab the source from GitHub.

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What Is DSPy?
DSPy is a Python framework for programming LLM workflows instead of hand-writing prompts.
It breaks an LLM workflow into explicit parts:

Signatures define the inputs and outputs.
Modules run the task with strategies such as Predict or ChainOfThought.
Metrics score the outputs.
Optimizers improve the program using examples and metrics.

This structure makes prompt engineering measurable. You can compare versions, optimize against a metric, and reuse the improved program.
Manual prompt DSPy program
————- ————
Task description —> Signature
Prompting style —> Module
Manual inspection —> Metric
Prompt tweaking —> Optimizer

Setup: Banking Query Classification
Install the libraries used in this tutorial:
pip install -U dspy pandas python-dotenv

This article uses dspy v3.2.1, pandas v2.3.1, and python-dotenv v1.1.1.
This tutorial uses OpenAI’s gpt-4o-mini through DSPy’s language model interface. Store your API key in a .env file:
OPENAI_API_KEY=your-openai-api-key

Then load the environment variables and configure DSPy:
from typing import Literal

import dspy
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

We will use BANKING77, a dataset of banking support questions labeled with customer intents. To keep loading simple, this tutorial reads the raw CSV files from the original PolyAI repository.
TRAIN_URL = "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv"
TEST_URL = "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv"

train_df = pd.read_csv(TRAIN_URL)
test_df = pd.read_csv(TEST_URL)

print(train_df.head())

Output
text category
0 I am still waiting on my card? card_arrival
1 What can I do if my card still hasn't arrived … card_arrival
2 I have been waiting over a week. Is the card s… card_arrival
3 Can I track my card while it is in the process… card_arrival
4 How do I know if I will get my card, or if it … card_arrival

To keep the example small, we will use four card-support intents instead of all 77 labels. The subset is still useful because card_arrival and card_delivery_estimate are similar enough to create meaningful mistakes.
INTENTS = [
"card_arrival",
"card_delivery_estimate",
"card_not_working",
"card_swallowed",
]

def sample_intents(data: pd.DataFrame, examples_per_intent: int) -> pd.DataFrame:
return (
data[data["category"].isin(INTENTS)]
.groupby("category", group_keys=False)
.sample(n=examples_per_intent, random_state=42)
.reset_index(drop=True)
)

train_sample = sample_intents(train_df, examples_per_intent=8)
dev_sample = sample_intents(test_df, examples_per_intent=10)

print(train_sample["category"].value_counts())

Output
category
card_arrival 8
card_delivery_estimate 8
card_not_working 8
card_swallowed 8
Name: count, dtype: int64

Before evaluation, prepare the data for DSPy:

Store each query-label pair as a dspy.Example.
Mark query as the input field with .with_inputs("query").
Keep intent as the target label DSPy will compare against the prediction.

def to_dspy_examples(data: pd.DataFrame) -> list[dspy.Example]:
return [
dspy.Example(query=row.text, intent=row.category).with_inputs("query")
for row in data.itertuples(index=False)
]

trainset = to_dspy_examples(train_sample)
devset = to_dspy_examples(dev_sample)

Let’s inspect one row to confirm that only query is marked as model input:
example = trainset[0]

print("Full example:")
print(example)

print("\nWhat the model receives:")
print(example.inputs())

print("\nExpected answer kept for scoring:")
print(example.intent)

Output
Full example:
Example({'query': 'If I ordered my new card last week, how much longer should I wait to receive it?', 'intent': 'card_arrival'}) (input_keys={'query'})

What the model receives:
Example({'query': 'If I ordered my new card last week, how much longer should I wait to receive it?'}) (input_keys={'query'})

Expected answer kept for scoring:
card_arrival

Notice that the full example contains both query and intent, but example.inputs() contains only query. This prevents the model from seeing the expected answer during prediction.
Define the Task with a Signature
A DSPy signature makes the task explicit. Instead of hiding the task inside a prompt string, you define the input fields, output fields, and output constraints in code.
The signature below defines the task schema:

Input field: query
Output field: intent
Allowed outputs: card_arrival, card_delivery_estimate, card_not_working, card_swallowed
Field descriptions: short hints DSPy can use when prompting the model

class ClassifyBankingIntent(dspy.Signature):
"""Classify a banking support query into one of the allowed intents."""

query: str = dspy.InputField(desc="Customer support query")
intent: Literal[
"card_arrival",
"card_delivery_estimate",
"card_not_working",
"card_swallowed",
] = dspy.OutputField(desc="Predicted banking intent")

The typed intent field is how DSPy keeps outputs within the allowed labels. For a dedicated way to enforce and validate typed LLM outputs with Python types, see Enforce Structured Outputs from LLMs with PydanticAI.
Run the Task with DSPy Modules
A DSPy module turns the signature into callable code.
Different modules run the same task in different ways:

Predict returns the output directly.
ChainOfThought adds a reasoning step before the output.
ReAct can call tools before answering.

Because they can share the same signature, you can switch strategies without redefining the task.
Predict: Direct Prediction
Predict is the simplest module. It asks the model to return the output directly.
predict_classifier = dspy.Predict(ClassifyBankingIntent)

prediction = predict_classifier(
query="The ATM kept my card and did not return it. How do I get it back?"
)

print(f'Intent: {prediction.intent}')

Output
Intent: card_swallowed

This matches the query and stays within the allowed intent labels.
ChainOfThought: Reason Before Predicting
ChainOfThought keeps the same input and output fields, but adds a reasoning step before the prediction:
cot_classifier = dspy.ChainOfThought(ClassifyBankingIntent)

prediction = cot_classifier(
query="The ATM kept my card and did not return it. How do I get it back?"
)

print(f'Reasoning: {prediction.reasoning}')
print(f'Intent: {prediction.intent}')

Output
Reasoning: The customer's query indicates that their card was not returned by an ATM, which suggests that the card was likely swallowed by the machine. The customer is seeking information on how to retrieve their card, which aligns with the intent of a card being swallowed by the ATM.
Intent: card_swallowed

Unlike Predict, ChainOfThought exposes the reasoning before the final label. The predicted intent is still card_swallowed.
ReAct: Use Tools Before Answering
ReAct is useful when the model needs to use tools before answering. In the example below:

lookup_transfer_status is a Python tool that retrieves transfer details.
dspy.ReAct decides when to call that tool and uses the result to answer.

def lookup_transfer_status(reference_id: str) -> str:
"""Return transfer status for a reference ID."""
transfers = {
"TRX-1042": "Completed on March 12. Recipient bank confirmed receipt.",
"TRX-2048": "Pending review. Expected completion within 1 business day.",
}
return transfers.get(reference_id, "Transfer reference not found.")

react_agent = dspy.ReAct(
signature="query -> answer", # receive query, return answer
tools=[lookup_transfer_status], # allow transfer lookup
max_iters=3, # stop after 3 iterations
)

response = react_agent(
query="Did transfer TRX-1042 reach the recipient?"
)

print(response.answer)

Output
Yes, transfer TRX-1042 has reached the recipient.

Notice that the model answers using the lookup result instead of guessing from the prompt alone.
Now that the modules are defined, the next step is to score the classifier versions and optimize one of them.
Evaluate the Baseline
Before optimizing the classifier, we need to measure the baseline. Here, the metric is simple: a prediction is correct when the predicted intent matches the expected label.
def intent_exact_match(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> bool:
return example.intent == prediction.intent

Now create a DSPy evaluator:
evaluate = dspy.Evaluate(
devset=devset, # examples to score
metric=intent_exact_match, # scoring function
num_threads=4, # parallel model calls
display_progress=True, # show progress bar
display_table=5, # show sample predictions
)

Use the same evaluator to compare Predict and ChainOfThought on the dev set:
predict_score = evaluate(predict_classifier)
print(f"Predict score: {predict_score.score}")

Because display_table=5 is set, the evaluator prints a sample of predictions before the score:

query
example_intent
pred_intent
intent_exact_match

My card still hasn’t arrived after 2 weeks. Is it lost?
card_arrival
card_arrival
✅ True

I’ve been waiting longer than expected for my card.
card_arrival
card_delivery_estimate
❌ False

I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?
card_arrival
card_delivery_estimate
❌ False

I think something went wrong with my card delivery as I haven’t received it yet.
card_arrival
card_delivery_estimate
❌ False

My card has not arrived yet.
card_arrival
card_arrival
✅ True

… 35 more rows not displayed …

Output
Predict score: 77.5

Run the same evaluator on ChainOfThought:
cot_score = evaluate(cot_classifier)
print(f"ChainOfThought score: {cot_score.score}")

query
example_intent
pred_intent
intent_exact_match

My card still hasn’t arrived after 2 weeks. Is it lost?
card_arrival
card_arrival
✅ True

I’ve been waiting longer than expected for my card.
card_arrival
card_arrival
✅ True

I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?
card_arrival
card_delivery_estimate
❌ False

I think something went wrong with my card delivery as I haven’t received it yet.
card_arrival
card_delivery_estimate
❌ False

My card has not arrived yet.
card_arrival
card_arrival
✅ True

… 35 more rows not displayed …

Output
ChainOfThought score: 80.0

The displayed rows make the comparison easier to inspect: you can see which examples matched the expected intent and which ones failed. In this run, ChainOfThought scores higher because the reasoning step helps with some ambiguous delivery queries.
Optimize the Classifier with Examples
Once the metric shows where the baseline fails, DSPy can use training examples to search for a better version of the program.
DSPy provides several optimizer options depending on how much search you want:

BootstrapFewShot improves the prompt by adding better examples.
MIPROv2 improves the prompt by tuning both instructions and examples.
GEPA improves the prompt by using feedback from previous attempts.

This article uses BootstrapFewShot because it is the simplest optimizer for this setup. It uses the training set and metric to choose examples that make the prompt stronger.
Few-shot examples are useful when the label name alone is not enough.
For example, card_arrival could sound like a successful delivery, but this example shows what it means in the dataset:
Query: My card has not arrived yet.
Intent: card_arrival

The label refers to questions or problems about card delivery. BootstrapFewShot helps find examples like this and add them to the prompt:

from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
metric=intent_exact_match, # score each candidate
max_bootstrapped_demos=4, # generated examples to keep
max_labeled_demos=8, # labeled examples to include
max_rounds=1, # bootstrap attempts per example
)

optimized_classifier = optimizer.compile(
student=cot_classifier,
trainset=trainset,
)

Inspect the examples added to the optimized prompt:
predictor = optimized_classifier.predictors()[0]

bootstrapped_demos = [
demo for demo in predictor.demos
if getattr(demo, "augmented", False)
]

for i, demo in enumerate(bootstrapped_demos, start=1):
print(f"Bootstrapped demo {i}")
print("Query:", demo.query)
print("Reasoning:", demo.reasoning)
print("Intent:", demo.intent)
print()

Output
Bootstrapped demo 1
Query: If I ordered my new card last week, how much longer should I wait to receive it?
Reasoning: The query asks about the expected waiting time for a newly ordered card, which suggests that the customer is inquiring about when it will arrive.
Intent: card_arrival

Bootstrapped demo 2
Query: Is there a reason my new card hasn't arrived?
Reasoning: The query is asking about the status of a new card that has not been received yet, indicating concern over the arrival of the card.
Intent: card_arrival

Bootstrapped demo 3
Query: I still haven't gotten my new card. When will it get here?
Reasoning: The query expresses concern about not receiving a new card yet and asks for information on its arrival. This indicates a focus on the status of the card's delivery.
Intent: card_arrival

Bootstrapped demo 4
Query: My card hasn't arrived in the mail yet. I ordered it two weeks ago. What can I do?
Reasoning: The customer is inquiring about the status of their card, which they have not received yet after ordering it two weeks ago. This indicates they are concerned about the arrival of their card.
Intent: card_arrival

The demos teach a consistent pattern: when the customer asks whether a new card has arrived, where it is, or what to do after waiting, the expected intent is card_arrival.
Compare Before vs. After
Evaluate the optimized classifier on the same dev set:
optimized_score = evaluate(optimized_classifier)

scores = pd.DataFrame(
[
{"program": "Predict", "score": predict_score.score},
{"program": "ChainOfThought", "score": cot_score.score},
{"program": "BootstrapFewShot + ChainOfThought", "score": optimized_score.score},
]
)

print(scores)

Output
program score
0 Predict 77.5
1 ChainOfThought 80.0
2 BootstrapFewShot + ChainOfThought 87.5

Nice! The optimized classifier performs best in this run, improving from 80.0 with ChainOfThought to 87.5 after adding optimized few-shot examples.
You can also inspect individual misses to understand what still fails:
for example in devset:
prediction = optimized_classifier(query=example.query)

if prediction.intent != example.intent:
print("Query:", example.query)
print("Expected:", example.intent)
print("Predicted:", prediction.intent)
print()

Output
Query: Is there tracking info available?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: Where is the tracking number for the card you sent me?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: Do you know if there is a tracking number for the new card you sent me?
Expected: card_arrival
Predicted: card_delivery_estimate

Query: I'm just wondering when my card will get here.
Expected: card_delivery_estimate
Predicted: card_arrival

Query: I am waiting for my card to arrive.
Expected: card_delivery_estimate
Predicted: card_arrival

Most misses are between card_arrival and card_delivery_estimate. That makes sense: both intents mention waiting for a card, tracking, or delivery timing.
To improve this, we could add more labeled examples that separate “my card has not arrived” from “how long does delivery take?”
Save and Reuse the Optimized Program
Optimization can take time and spend LLM tokens, so you do not want to run it every time you classify a query. Instead, save the optimized classifier once so it can be loaded later:
save_path = "optimized_banking_classifier.json"
optimized_classifier.save(save_path)

When you need the classifier again, rebuild the same DSPy module and load the saved file:
loaded_classifier = dspy.ChainOfThought(ClassifyBankingIntent)
loaded_classifier.load(path=save_path)

prediction = loaded_classifier(
query="I have been waiting two weeks and my new card still has not arrived."
)

print(prediction.intent)

Output
card_arrival

This skips optimization and reuses the same saved prompt, making inference faster and reproducible.
Final Thoughts
DSPy is worth using when an LLM workflow will run repeatedly and quality matters. It is especially useful when you have:

Labeled examples
A metric
Several prompt or module versions to compare
A task that will evolve over time

It is probably too much for one-off prompts, quick brainstorming, or tasks where you do not have examples to evaluate against.
In this article, we followed the core DSPy workflow:

Define the task
Run the task with different strategies
Evaluate each version
Optimize the workflow with examples
Save the optimized result for reuse

Once this workflow is familiar, you can extend it with larger dev sets, more intent labels, and richer metrics. For more advanced optimization, explore DSPy’s MIPROv2 and GEPA docs.
Related Tutorials

Structured Output Tools for LLMs: Instructor, PydanticAI, LangChain, Outlines, and Guidance Compared: Compares libraries that force LLMs to return valid, typed outputs, the same problem DSPy signatures solve.
Build Production-Ready RAG Systems with MLflow Quality Metrics: Measures LLM output quality with metrics, complementing DSPy’s evaluate-and-optimize loop.

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Stop Hand-Tuning Prompts: Auto-Optimize an LLM Classifier with DSPy Read More »

Build Production-Ready RAG Systems with MLflow Quality Metrics

Table of Contents

What is MLflow GenAI?
Article Overview
Quick Setup
Installation
Environment Configuration
Importing Libraries
RAG System with Ollama Llama3.2
Evaluation Dataset

Core RAG Metrics
Faithfulness Evaluation
Answer Relevance Evaluation

Running and Interpreting Results
Comprehensive Evaluation with MLflow
Viewing Results in MLflow Dashboard

Interpreting the Results
Next Steps

How do you know if your AI model actually works? AI model outputs can be inconsistent – sometimes providing inaccurate responses, irrelevant information, or answers that don’t align with the input context. Manual evaluation of these issues is time-consuming and doesn’t scale as your system grows.
MLflow for GenAI solves this problem by automating evaluation across two critical areas:

Faithfulness: Ensuring responses match retrieved context
Answer Relevance: Verifying outputs address user questions

Key Takeaways
Here’s what you’ll learn:

Automate RAG quality assessment with faithfulness and relevance scoring using MLflow
Build production-ready evaluation pipelines that scale from prototype to enterprise
Track experiment results in interactive MLflow dashboards with zero manual scoring
Implement AI judges powered by GPT-4 for consistent evaluation at scale
Identify low-performing questions with scores below 3.0 for targeted improvements

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What is MLflow GenAI?
MLflow is an open-source platform for managing machine learning lifecycles – tracking experiments, packaging models, and managing deployments. Traditional MLflow focuses on numerical metrics like accuracy and loss.
MLflow for GenAI extends this foundation specifically for generative AI applications. It evaluates subjective qualities that numerical metrics can’t capture:

Response relevance: Measures whether outputs address user questions
Factual accuracy: Checks if responses stay truthful to source material
Context adherence: Evaluates whether answers stick to retrieved information
Automated scoring: Uses AI judges instead of manual evaluation
Scalable assessment: Handles large datasets without human reviewers

Article Overview
This article walks you through a complete AI evaluation workflow. You’ll build a RAG (Retrieval-Augmented Generation) system, test it with real data, and measure its performance using automated tools. For comprehensive RAG fundamentals, see our LangChain and Ollama guide.
What you’ll build:

RAG system: Create a question-answering system using Ollama’s Llama3.
Test dataset: Design evaluation data that reveals system strengths and weaknesses
Automated evaluation: Use OpenAI-powered metrics to score response quality
MLflow interface: Track experiments and visualize results in an interactive dashboard
Results analysis: Interpret scores and identify areas for improvement

Quick Setup
Installation
Start by installing the necessary packages for this guide.
pip install 'mlflow>=3.0.0rc0' langchain-ollama pandas

Environment Configuration
We’ll use Ollama to run Llama3.2 locally for our RAG system. Ollama lets you download and run AI models on your computer, keeping your question-answering data private while eliminating API costs.
Ensure you have Ollama installed locally and the Llama3.2 model downloaded.
# Install Ollama (if not already installed)
# Visit https://ollama.ai for installation instructions

# Pull the Llama3.2 model
ollama pull llama3.2

Importing Libraries
Import the necessary libraries for our RAG system and MLflow evaluation.
import os
import pandas as pd
import mlflow
from mlflow.metrics.genai import faithfulness, answer_relevance, make_genai_metric
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Note: Ensure Ollama is installed and llama3.2 model is available
# Run: ollama pull llama3.2

RAG System with Ollama Llama3.2
We’ll create a real RAG (Retrieval-Augmented Generation) system using Ollama’s Llama3.2 model that retrieves context and generates answers.
This function creates a question-answering system that:

Takes a question and available documents as input
Uses the most relevant documents to provide context
Generates accurate answers using the Llama3.2 model
Returns both the answer and the sources used

def ollama_rag_system(question, context_docs):
"""Real RAG system using Ollama Llama3.2"""
# Retrieve top 2 most relevant documents
retrieved_context = "\n".join(context_docs[:2])

# Create prompt template
prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the provided context.
Be concise and accurate.

Context: {context}
Question: {question}

Answer:"""
)

# Initialize Llama3.2 model
llm = ChatOllama(model="llama3.2", temperature=0)

# Create chain and get response
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"context": retrieved_context, "question": question})

return {
"answer": answer,
"retrieved_context": retrieved_context,
"retrieved_docs": context_docs[:2],
}

For implementing vector databases with Pinecone, see our Pinecone and Ollama semantic search guide.
Evaluation Dataset
An evaluation dataset helps you measure system quality systematically. It reveals how well your RAG system handles different question types and identifies areas for improvement.
To create an evaluation dataset, start with a knowledge base of documents that answer questions. Build the dataset with questions, expected answers, and context from this knowledge base.

For processing complex PDFs into RAG-ready data, explore our Docling document processing guide.

knowledge_base = [
"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.",
"RAG systems combine retrieval and generation to provide accurate, contextual responses. They first retrieve relevant documents then generate answers.",
"Vector databases store document embeddings for efficient similarity search. They enable fast retrieval of relevant information."
]

eval_data = pd.DataFrame({
"question": [
"What is MLflow?",
"How does RAG work?",
"What are vector databases used for?"
],
"expected_answer": [
"MLflow is an open-source platform for managing machine learning workflows",
"RAG combines retrieval and generation for contextual responses",
"Vector databases store embeddings for similarity search"
],
"context": [
knowledge_base[0],
knowledge_base[1],
knowledge_base[2]
]
})

eval_data

Index
Question
Expected Answer
Context

0
What is MLflow?
Open-source ML workflow platform
MLflow manages ML lifecycles with tracking, packaging…

1
How does RAG work?
Combines retrieval and generation
RAG systems retrieve documents then generate answers…

2
What are vector databases used for?
Store embeddings for similarity search
Vector databases enable fast retrieval of information…

Generate answers for each question using the RAG system. This creates the responses we’ll evaluate for quality and accuracy.
# Generate answers for evaluation
def generate_answers(row):
result = ollama_rag_system(row['question'], [row['context']])
return result['answer']

eval_data['generated_answer'] = eval_data.apply(generate_answers, axis=1)

Print the first row to see the question, context, and generated answer.
# Display the first row to see question, context, and answer
print(f"Question: {eval_data.iloc[0]['question']}")
print(f"Context: {eval_data.iloc[0]['context']}")
print(f"Generated Answer: {eval_data.iloc[0]['generated_answer']}")

The output displays three key components:

The question shows what we asked.
The context shows which documents the system used to generate the answer.
The answer contains the RAG system’s response.

Question: What is MLflow?
Context: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.
Generated Answer: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, providing features such as experiment tracking, model packaging, versioning, and deployment capabilities.

Core RAG Metrics
Faithfulness Evaluation
Faithfulness measures whether the generated answer stays true to the retrieved context, preventing hallucination:
In the code below, we define the function evaluate_faithfulness that:

Creates an AI judge using GPT-4 to evaluate faithfulness.
Takes the generated answer, question, and context as input.
Returns a score from 1-5, where 5 indicates perfect faithfulness.

We then apply this function to the evaluation dataset to get the faithfulness score for each question.
# Evaluate faithfulness for each answer
def evaluate_faithfulness(row):
# Initialize faithfulness metric with OpenAI GPT-4 as judge
faithfulness_metric = faithfulness(model="openai:/gpt-4")
score = faithfulness_metric(
predictions=[row['generated_answer']],
inputs=[row['question']],
context=[row['context']],
)
return score.scores[0]

eval_data['faithfulness_score'] = eval_data.apply(evaluate_faithfulness, axis=1)
print("Faithfulness Evaluation Results:")
print(eval_data[['question', 'faithfulness_score']])

Faithfulness Evaluation Results:

Question
Faithfulness Score

What is MLflow?
5

How does RAG work?
5

What are vector databases used for?
5

Perfect scores of 5 show the RAG system answers remain faithful to the source material. No hallucination or unsupported claims were detected.
Answer Relevance Evaluation
Answer relevance measures whether the response actually addresses the question asked:
# Evaluate answer relevance
def evaluate_relevance(row):
# Initialize answer relevance metric
relevance_metric = answer_relevance(model="openai:/gpt-4")
score = relevance_metric(
predictions=[row['generated_answer']],
inputs=[row['question']]
)
return score.scores[0]

eval_data['relevance_score'] = eval_data.apply(evaluate_relevance, axis=1)
print("Answer Relevance Results:")
print(eval_data[['question', 'relevance_score']])

Answer Relevance Results:

Question
Relevance Score

What is MLflow?
5

How does RAG work?
5

What are vector databases used for?
5

Perfect scores of 5 show the RAG system’s responses directly address the questions asked. No irrelevant or off-topic answers were generated.
Running and Interpreting Results
We’ll now combine individual metrics into a comprehensive MLflow evaluation. This creates detailed reports, tracks experiments, and enables result comparison. Finally, we’ll analyze the scores to identify areas for improvement.
Comprehensive Evaluation with MLflow
Start by using MLflow’s evaluation framework to run all metrics together.
The following code:

Defines a model function that MLflow can evaluate systematically
Takes a DataFrame of questions and processes them through the RAG system
Converts results to a list format required by MLflow
Combines all metrics into a single evaluation run for comprehensive reporting

# Prepare data for MLflow evaluation
def rag_model_function(input_df):
"""Model function for MLflow evaluation"""
def process_row(row):
result = ollama_rag_system(row["question"], [row["context"]])
return result["answer"]

return input_df.apply(process_row, axis=1).tolist()

# Run comprehensive evaluation
with mlflow.start_run() as run:
evaluation_results = mlflow.evaluate(
model=rag_model_function,
data=eval_data[
["question", "context", "expected_answer"]
], # Include expected_answer column
targets="expected_answer",
extra_metrics=[faithfulness_metric, relevance_metric],
evaluator_config={
"col_mapping": {
"inputs": "question",
"context": "context",
"predictions": "predictions",
"targets": "expected_answer",
}
},
)

After running the code, the evaluation results get stored in MLflow’s tracking system. You can now compare different runs and analyze performance metrics through the dashboard.
Viewing Results in MLflow Dashboard
Launch the MLflow UI to explore evaluation results interactively:
mlflow ui

Navigate to http://localhost:5000 to access the dashboard.
The MLflow dashboard shows the Experiments table with two evaluation runs. Each run displays the run name (like “bold-slug-816”), creation time, dataset information, and duration. You can select runs to compare their performance metrics.

Click on any experiment to see the details of the evaluation. When you scroll down to the Metrics section, you will see detailed evaluation metrics including faithfulness and relevance scores for each question.

Clicking on “Traces” will show you the detailed request-response pairs for each evaluation question for debugging and analysis.

Clicking on “Artifacts” reveals the evaluation results table containing the complete evaluation data, metric scores, and a downloadable format for external analysis.

Interpreting the Results
Raw scores need interpretation to drive improvements. Use MLflow’s evaluation data to identify specific areas for enhancement.
The analysis:

Extracts performance metrics from comprehensive evaluation results
Calculates mean scores across all questions for both metrics
Identifies underperforming questions that require attention
Generates targeted feedback for systematic improvement

def interpret_evaluation_results(evaluation_results):
"""Analyze MLflow evaluation results"""

# Extract metrics and data
metrics = evaluation_results.metrics
eval_table = evaluation_results.tables['eval_results_table']

# Overall performance
avg_faithfulness = metrics.get('faithfulness/v1/mean', 0)
avg_relevance = metrics.get('answer_relevance/v1/mean', 0)

print(f"Average Scores:")
print(f"Faithfulness: {avg_faithfulness:.2f}")
print(f"Answer Relevance: {avg_relevance:.2f}")

# Identify problematic questions
low_performing = eval_table[
(eval_table['faithfulness/v1/score'] < 3) |
(eval_table['answer_relevance/v1/score'] < 3)
]

if not low_performing.empty:
print(f"\nQuestions needing improvement: {len(low_performing)}")
for _, row in low_performing.iterrows():
print(f"- {row['inputs']}")
else:
print("\nAll questions performing well!")

# Usage
interpret_evaluation_results(evaluation_results)

Average Scores:
Faithfulness: 5.00
Answer Relevance: 5.00

All questions performing well!

Perfect scores indicate the RAG system generates accurate, contextual responses without hallucination. This baseline establishes a benchmark for future system modifications and more complex evaluation datasets.
Next Steps
This evaluation framework provides the foundation for systematically improving your RAG system:

Regular Evaluation: Run these metrics on your test dataset with each system change
Threshold Setting: Establish minimum acceptable scores for each metric based on your requirements
Automated Monitoring: Integrate these evaluations into your CI/CD pipeline
Iterative Improvement: Use the insights to guide retrieval improvements, prompt engineering, and model selection

For versioning your ML experiments and models systematically, see our DVC version control guide.
The combination of faithfulness, answer relevance, and retrieval quality metrics gives you a comprehensive view of your RAG system’s performance, enabling data-driven improvements and reliable quality assurance.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Build Production-Ready RAG Systems with MLflow Quality Metrics Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran