Table of Contents
- What is MLflow GenAI?
- Article Overview
- Quick Setup
- Core RAG Metrics
- Running and Interpreting Results
- Interpreting the Results
- Next Steps
Build Production-Ready RAG Systems with MLflow Quality Metrics
How do you know if your AI model actually works? AI model outputs can be inconsistent – sometimes providing inaccurate responses, irrelevant information, or answers that don’t align with the input context. Manual evaluation of these issues is time-consuming and doesn’t scale as your system grows.
MLflow for GenAI automates evaluation across three critical areas: faithfulness (responses match retrieved context) and answer relevance (outputs address user questions). This guide teaches you to implement these evaluations and systematically improve your AI system’s performance.
What is MLflow GenAI?
MLflow is an open-source platform for managing machine learning lifecycles – tracking experiments, packaging models, and managing deployments. Traditional MLflow focuses on numerical metrics like accuracy and loss.
MLflow for GenAI extends this foundation specifically for generative AI applications. It evaluates subjective qualities that numerical metrics can’t capture:
- Response relevance: Measures whether outputs address user questions
- Factual accuracy: Checks if responses stay truthful to source material
- Context adherence: Evaluates whether answers stick to retrieved information
- Automated scoring: Uses AI judges instead of manual evaluation
- Scalable assessment: Handles large datasets without human reviewers
Article Overview
This guide walks you through a complete AI evaluation workflow. You’ll build a RAG (Retrieval-Augmented Generation) system, test it with real data, and measure its performance using automated tools. For comprehensive RAG fundamentals, see our LangChain and Ollama guide.
What you’ll build:
- RAG system: Create a question-answering system using Ollama’s Llama3.
- Test dataset: Design evaluation data that reveals system strengths and weaknesses
- Automated evaluation: Use OpenAI-powered metrics to score response quality
- MLflow interface: Track experiments and visualize results in an interactive dashboard
- Results analysis: Interpret scores and identify areas for improvement
Quick Setup
Installation
Start by installing the necessary packages for this guide.
pip install 'mlflow>=3.0.0rc0' langchain-ollama pandas
Environment Configuration
We’ll use Ollama to run Llama3.2 locally for our RAG system. Ollama lets you download and run AI models on your computer, keeping your question-answering data private while eliminating API costs.
Ensure you have Ollama installed locally and the Llama3.2 model downloaded.
# Install Ollama (if not already installed)
# Visit https://ollama.ai for installation instructions
# Pull the Llama3.2 model
ollama pull llama3.2
Importing Libraries
Import the necessary libraries for our RAG system and MLflow evaluation.
import os
import pandas as pd
import mlflow
from mlflow.metrics.genai import faithfulness, answer_relevance, make_genai_metric
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Note: Ensure Ollama is installed and llama3.2 model is available
# Run: ollama pull llama3.2
RAG System with Ollama Llama3.2
We’ll create a real RAG (Retrieval-Augmented Generation) system using Ollama’s Llama3.2 model that retrieves context and generates answers.
This function creates a question-answering system that:
- Takes a question and available documents as input
- Uses the most relevant documents to provide context
- Generates accurate answers using the Llama3.2 model
- Returns both the answer and the sources used
def ollama_rag_system(question, context_docs):
"""Real RAG system using Ollama Llama3.2"""
# Retrieve top 2 most relevant documents
retrieved_context = "\n".join(context_docs[:2])
# Create prompt template
prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the provided context.
Be concise and accurate.
Context: {context}
Question: {question}
Answer:"""
)
# Initialize Llama3.2 model
llm = ChatOllama(model="llama3.2", temperature=0)
# Create chain and get response
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"context": retrieved_context, "question": question})
return {
"answer": answer,
"retrieved_context": retrieved_context,
"retrieved_docs": context_docs[:2],
}
Evaluation Dataset
An evaluation dataset helps you measure system quality systematically. It reveals how well your RAG system handles different question types and identifies areas for improvement.
To create an evaluation dataset, start with a knowledge base of documents that answer questions. Build the dataset with questions, expected answers, and context from this knowledge base.
knowledge_base = [
"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.",
"RAG systems combine retrieval and generation to provide accurate, contextual responses. They first retrieve relevant documents then generate answers.",
"Vector databases store document embeddings for efficient similarity search. They enable fast retrieval of relevant information."
]
eval_data = pd.DataFrame({
"question": [
"What is MLflow?",
"How does RAG work?",
"What are vector databases used for?"
],
"expected_answer": [
"MLflow is an open-source platform for managing machine learning workflows",
"RAG combines retrieval and generation for contextual responses",
"Vector databases store embeddings for similarity search"
],
"context": [
knowledge_base[0],
knowledge_base[1],
knowledge_base[2]
]
})
eval_data
Index | Question | Expected Answer | Context |
---|---|---|---|
0 | What is MLflow? | Open-source ML workflow platform | MLflow manages ML lifecycles with tracking, packaging… |
1 | How does RAG work? | Combines retrieval and generation | RAG systems retrieve documents then generate answers… |
2 | What are vector databases used for? | Store embeddings for similarity search | Vector databases enable fast retrieval of information… |
Generate answers for each question using the RAG system. This creates the responses we’ll evaluate for quality and accuracy.
# Generate answers for evaluation
def generate_answers(row):
result = ollama_rag_system(row['question'], [row['context']])
return result['answer']
eval_data['generated_answer'] = eval_data.apply(generate_answers, axis=1)
Print the first row to see the question, context, and generated answer.
# Display the first row to see question, context, and answer
print(f"Question: {eval_data.iloc[0]['question']}")
print(f"Context: {eval_data.iloc[0]['context']}")
print(f"Generated Answer: {eval_data.iloc[0]['generated_answer']}")
The output displays three key components:
- The question shows what we asked.
- The context shows which documents the system used to generate the answer.
- The answer contains the RAG system’s response.
Question: What is MLflow?
Context: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.
Generated Answer: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, providing features such as experiment tracking, model packaging, versioning, and deployment capabilities.
Core RAG Metrics
Faithfulness Evaluation
Faithfulness measures whether the generated answer stays true to the retrieved context, preventing hallucination:
In the code below, we define the function evaluate_faithfulness
that:
- Creates an AI judge using GPT-4 to evaluate faithfulness.
- Takes the generated answer, question, and context as input.
- Returns a score from 1-5, where 5 indicates perfect faithfulness.
We then apply this function to the evaluation dataset to get the faithfulness score for each question.
# Evaluate faithfulness for each answer
def evaluate_faithfulness(row):
# Initialize faithfulness metric with OpenAI GPT-4 as judge
faithfulness_metric = faithfulness(model="openai:/gpt-4")
score = faithfulness_metric(
predictions=[row['generated_answer']],
inputs=[row['question']],
context=[row['context']],
)
return score.scores[0]
eval_data['faithfulness_score'] = eval_data.apply(evaluate_faithfulness, axis=1)
print("Faithfulness Evaluation Results:")
print(eval_data[['question', 'faithfulness_score']])
Faithfulness Evaluation Results:
Question | Faithfulness Score |
---|---|
What is MLflow? | 5 |
How does RAG work? | 5 |
What are vector databases used for? | 5 |
Perfect scores of 5 show the RAG system answers remain faithful to the source material. No hallucination or unsupported claims were detected.
Answer Relevance Evaluation
Answer relevance measures whether the response actually addresses the question asked:
# Evaluate answer relevance
def evaluate_relevance(row):
# Initialize answer relevance metric
relevance_metric = answer_relevance(model="openai:/gpt-4")
score = relevance_metric(
predictions=[row['generated_answer']],
inputs=[row['question']]
)
return score.scores[0]
eval_data['relevance_score'] = eval_data.apply(evaluate_relevance, axis=1)
print("Answer Relevance Results:")
print(eval_data[['question', 'relevance_score']])
Answer Relevance Results:
Question | Relevance Score |
---|---|
What is MLflow? | 5 |
How does RAG work? | 5 |
What are vector databases used for? | 5 |
Perfect scores of 5 show the RAG system’s responses directly address the questions asked. No irrelevant or off-topic answers were generated.
Running and Interpreting Results
We’ll now combine individual metrics into a comprehensive MLflow evaluation. This creates detailed reports, tracks experiments, and enables result comparison. Finally, we’ll analyze the scores to identify areas for improvement.
Comprehensive Evaluation with MLflow
Start by using MLflow’s evaluation framework to run all metrics together.
The following code:
- Defines a model function that MLflow can evaluate systematically
- Takes a DataFrame of questions and processes them through the RAG system
- Converts results to a list format required by MLflow
- Combines all metrics into a single evaluation run for comprehensive reporting
# Prepare data for MLflow evaluation
def rag_model_function(input_df):
"""Model function for MLflow evaluation"""
def process_row(row):
result = ollama_rag_system(row["question"], [row["context"]])
return result["answer"]
return input_df.apply(process_row, axis=1).tolist()
# Run comprehensive evaluation
with mlflow.start_run() as run:
evaluation_results = mlflow.evaluate(
model=rag_model_function,
data=eval_data[
["question", "context", "expected_answer"]
], # Include expected_answer column
targets="expected_answer",
extra_metrics=[faithfulness_metric, relevance_metric],
evaluator_config={
"col_mapping": {
"inputs": "question",
"context": "context",
"predictions": "predictions",
"targets": "expected_answer",
}
},
)
After running the code, the evaluation results get stored in MLflow’s tracking system. You can now compare different runs and analyze performance metrics through the dashboard.
Viewing Results in MLflow Dashboard
Launch the MLflow UI to explore evaluation results interactively:
mlflow ui
Navigate to http://localhost:5000
to access the dashboard.
The MLflow dashboard shows the Experiments table with two evaluation runs. Each run displays the run name (like “bold-slug-816”), creation time, dataset information, and duration. You can select runs to compare their performance metrics.
Click on any experiment to see the details of the evaluation. When you scroll down to the Metrics section, you will see detailed evaluation metrics including faithfulness and relevance scores for each question.
Clicking on “Traces” will show you the detailed request-response pairs for each evaluation question for debugging and analysis.
Clicking on “Artifacts” reveals the evaluation results table containing the complete evaluation data, metric scores, and a downloadable format for external analysis.
Interpreting the Results
Raw scores need interpretation to drive improvements. Use MLflow’s evaluation data to identify specific areas for enhancement.
The analysis:
- Extracts performance metrics from comprehensive evaluation results
- Calculates mean scores across all questions for both metrics
- Identifies underperforming questions that require attention
- Generates targeted feedback for systematic improvement
def interpret_evaluation_results(evaluation_results):
"""Analyze MLflow evaluation results"""
# Extract metrics and data
metrics = evaluation_results.metrics
eval_table = evaluation_results.tables['eval_results_table']
# Overall performance
avg_faithfulness = metrics.get('faithfulness/v1/mean', 0)
avg_relevance = metrics.get('answer_relevance/v1/mean', 0)
print(f"Average Scores:")
print(f"Faithfulness: {avg_faithfulness:.2f}")
print(f"Answer Relevance: {avg_relevance:.2f}")
# Identify problematic questions
low_performing = eval_table[
(eval_table['faithfulness/v1/score'] < 3) |
(eval_table['answer_relevance/v1/score'] < 3)
]
if not low_performing.empty:
print(f"\nQuestions needing improvement: {len(low_performing)}")
for _, row in low_performing.iterrows():
print(f"- {row['inputs']}")
else:
print("\nAll questions performing well!")
# Usage
interpret_evaluation_results(evaluation_results)
Average Scores:
Faithfulness: 5.00
Answer Relevance: 5.00
All questions performing well!
Perfect scores indicate the RAG system generates accurate, contextual responses without hallucination. This baseline establishes a benchmark for future system modifications and more complex evaluation datasets.
Next Steps
This evaluation framework provides the foundation for systematically improving your RAG system:
- Regular Evaluation: Run these metrics on your test dataset with each system change
- Threshold Setting: Establish minimum acceptable scores for each metric based on your requirements
- Automated Monitoring: Integrate these evaluations into your CI/CD pipeline
- Iterative Improvement: Use the insights to guide retrieval improvements, prompt engineering, and model selection
The combination of faithfulness, answer relevance, and retrieval quality metrics gives you a comprehensive view of your RAG system’s performance, enabling data-driven improvements and reliable quality assurance.