Build Production-Ready RAG Systems with MLflow Quality Metrics

Khuyen Tran

What is MLflow GenAI?
Article Overview
Quick Setup
Core RAG Metrics
- Faithfulness Evaluation
- Answer Relevance Evaluation
Running and Interpreting Results
- Comprehensive Evaluation with MLflow
- Viewing Results in MLflow Dashboard
Interpreting the Results
Next Steps

Build Production-Ready RAG Systems with MLflow Quality Metrics

How do you know if your AI model actually works? AI model outputs can be inconsistent – sometimes providing inaccurate responses, irrelevant information, or answers that don’t align with the input context. Manual evaluation of these issues is time-consuming and doesn’t scale as your system grows.

MLflow for GenAI solves this problem by automating evaluation across two critical areas:

Faithfulness: Ensuring responses match retrieved context
Answer Relevance: Verifying outputs address user questions

This article teaches you to implement these evaluations and systematically improve your AI system’s performance.

What is MLflow GenAI?

MLflow is an open-source platform for managing machine learning lifecycles – tracking experiments, packaging models, and managing deployments. Traditional MLflow focuses on numerical metrics like accuracy and loss.

MLflow for GenAI extends this foundation specifically for generative AI applications. It evaluates subjective qualities that numerical metrics can’t capture:

Response relevance: Measures whether outputs address user questions
Factual accuracy: Checks if responses stay truthful to source material
Context adherence: Evaluates whether answers stick to retrieved information
Automated scoring: Uses AI judges instead of manual evaluation
Scalable assessment: Handles large datasets without human reviewers

Article Overview

This article walks you through a complete AI evaluation workflow. You’ll build a RAG (Retrieval-Augmented Generation) system, test it with real data, and measure its performance using automated tools. For comprehensive RAG fundamentals, see our LangChain and Ollama guide.

What you’ll build:

RAG system: Create a question-answering system using Ollama’s Llama3.
Test dataset: Design evaluation data that reveals system strengths and weaknesses
Automated evaluation: Use OpenAI-powered metrics to score response quality
MLflow interface: Track experiments and visualize results in an interactive dashboard
Results analysis: Interpret scores and identify areas for improvement

Quick Setup

Installation

Start by installing the necessary packages for this guide.

pip install 'mlflow>=3.0.0rc0' langchain-ollama pandas

Environment Configuration

We’ll use Ollama to run Llama3.2 locally for our RAG system. Ollama lets you download and run AI models on your computer, keeping your question-answering data private while eliminating API costs.

Ensure you have Ollama installed locally and the Llama3.2 model downloaded.

# Install Ollama (if not already installed)
# Visit https://ollama.ai for installation instructions

# Pull the Llama3.2 model
ollama pull llama3.2

Importing Libraries

Import the necessary libraries for our RAG system and MLflow evaluation.

import os
import pandas as pd
import mlflow
from mlflow.metrics.genai import faithfulness, answer_relevance, make_genai_metric
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Note: Ensure Ollama is installed and llama3.2 model is available
# Run: ollama pull llama3.2

RAG System with Ollama Llama3.2

We’ll create a real RAG (Retrieval-Augmented Generation) system using Ollama’s Llama3.2 model that retrieves context and generates answers.

This function creates a question-answering system that:

Takes a question and available documents as input
Uses the most relevant documents to provide context
Generates accurate answers using the Llama3.2 model
Returns both the answer and the sources used

def ollama_rag_system(question, context_docs):
    """Real RAG system using Ollama Llama3.2"""
    # Retrieve top 2 most relevant documents
    retrieved_context = "\n".join(context_docs[:2])

    # Create prompt template
    prompt = ChatPromptTemplate.from_template(
        """Answer the question based on the provided context.
        Be concise and accurate.

        Context: {context}
        Question: {question}

        Answer:"""
    )

    # Initialize Llama3.2 model
    llm = ChatOllama(model="llama3.2", temperature=0)

    # Create chain and get response
    chain = prompt | llm | StrOutputParser()
    answer = chain.invoke({"context": retrieved_context, "question": question})

    return {
        "answer": answer,
        "retrieved_context": retrieved_context,
        "retrieved_docs": context_docs[:2],
    }

For implementing vector databases with Pinecone, see our Pinecone and Ollama semantic search guide.

Evaluation Dataset

An evaluation dataset helps you measure system quality systematically. It reveals how well your RAG system handles different question types and identifies areas for improvement.

To create an evaluation dataset, start with a knowledge base of documents that answer questions. Build the dataset with questions, expected answers, and context from this knowledge base.

For processing complex PDFs into RAG-ready data, explore our Docling document processing guide.

knowledge_base = [
    "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.",
    "RAG systems combine retrieval and generation to provide accurate, contextual responses. They first retrieve relevant documents then generate answers.",
    "Vector databases store document embeddings for efficient similarity search. They enable fast retrieval of relevant information."
]

eval_data = pd.DataFrame({
    "question": [
        "What is MLflow?",
        "How does RAG work?",
        "What are vector databases used for?"
    ],
    "expected_answer": [
        "MLflow is an open-source platform for managing machine learning workflows",
        "RAG combines retrieval and generation for contextual responses",
        "Vector databases store embeddings for similarity search"
    ],
    "context": [
        knowledge_base[0],
        knowledge_base[1],
        knowledge_base[2]
    ]
})

eval_data

Index	Question	Expected Answer	Context
0	What is MLflow?	Open-source ML workflow platform	MLflow manages ML lifecycles with tracking, packaging…
1	How does RAG work?	Combines retrieval and generation	RAG systems retrieve documents then generate answers…
2	What are vector databases used for?	Store embeddings for similarity search	Vector databases enable fast retrieval of information…

Generate answers for each question using the RAG system. This creates the responses we’ll evaluate for quality and accuracy.

# Generate answers for evaluation
def generate_answers(row):
    result = ollama_rag_system(row['question'], [row['context']])
    return result['answer']

eval_data['generated_answer'] = eval_data.apply(generate_answers, axis=1)

Print the first row to see the question, context, and generated answer.

# Display the first row to see question, context, and answer
print(f"Question: {eval_data.iloc[0]['question']}")
print(f"Context: {eval_data.iloc[0]['context']}")
print(f"Generated Answer: {eval_data.iloc[0]['generated_answer']}")

The output displays three key components:

The question shows what we asked.
The context shows which documents the system used to generate the answer.
The answer contains the RAG system’s response.

Question: What is MLflow?
Context: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.
Generated Answer: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, providing features such as experiment tracking, model packaging, versioning, and deployment capabilities.

Core RAG Metrics

Faithfulness Evaluation

Faithfulness measures whether the generated answer stays true to the retrieved context, preventing hallucination:

In the code below, we define the function evaluate_faithfulness that:

Creates an AI judge using GPT-4 to evaluate faithfulness.
Takes the generated answer, question, and context as input.
Returns a score from 1-5, where 5 indicates perfect faithfulness.

We then apply this function to the evaluation dataset to get the faithfulness score for each question.

# Evaluate faithfulness for each answer
def evaluate_faithfulness(row):
    # Initialize faithfulness metric with OpenAI GPT-4 as judge
    faithfulness_metric = faithfulness(model="openai:/gpt-4")
    score = faithfulness_metric(
        predictions=[row['generated_answer']],
        inputs=[row['question']],
        context=[row['context']],
    )
    return score.scores[0]

eval_data['faithfulness_score'] = eval_data.apply(evaluate_faithfulness, axis=1)
print("Faithfulness Evaluation Results:")
print(eval_data[['question', 'faithfulness_score']])

Faithfulness Evaluation Results:

Question	Faithfulness Score
What is MLflow?	5
How does RAG work?	5
What are vector databases used for?	5

Perfect scores of 5 show the RAG system answers remain faithful to the source material. No hallucination or unsupported claims were detected.

Answer Relevance Evaluation

Answer relevance measures whether the response actually addresses the question asked:

# Evaluate answer relevance
def evaluate_relevance(row):
    # Initialize answer relevance metric
    relevance_metric = answer_relevance(model="openai:/gpt-4")
    score = relevance_metric(
        predictions=[row['generated_answer']],
        inputs=[row['question']]
    )
    return score.scores[0]

eval_data['relevance_score'] = eval_data.apply(evaluate_relevance, axis=1)
print("Answer Relevance Results:")
print(eval_data[['question', 'relevance_score']])

Answer Relevance Results:

Question	Relevance Score
What is MLflow?	5
How does RAG work?	5
What are vector databases used for?	5

Perfect scores of 5 show the RAG system’s responses directly address the questions asked. No irrelevant or off-topic answers were generated.

Running and Interpreting Results

We’ll now combine individual metrics into a comprehensive MLflow evaluation. This creates detailed reports, tracks experiments, and enables result comparison. Finally, we’ll analyze the scores to identify areas for improvement.

Comprehensive Evaluation with MLflow

Start by using MLflow’s evaluation framework to run all metrics together.

The following code:

Defines a model function that MLflow can evaluate systematically
Takes a DataFrame of questions and processes them through the RAG system
Converts results to a list format required by MLflow
Combines all metrics into a single evaluation run for comprehensive reporting

# Prepare data for MLflow evaluation
def rag_model_function(input_df):
    """Model function for MLflow evaluation"""
    def process_row(row):
        result = ollama_rag_system(row["question"], [row["context"]])
        return result["answer"]

    return input_df.apply(process_row, axis=1).tolist()


# Run comprehensive evaluation
with mlflow.start_run() as run:
    evaluation_results = mlflow.evaluate(
        model=rag_model_function,
        data=eval_data[
            ["question", "context", "expected_answer"]
        ],  # Include expected_answer column
        targets="expected_answer",
        extra_metrics=[faithfulness_metric, relevance_metric],
        evaluator_config={
            "col_mapping": {
                "inputs": "question",
                "context": "context",
                "predictions": "predictions",
                "targets": "expected_answer",
            }
        },
    )

After running the code, the evaluation results get stored in MLflow’s tracking system. You can now compare different runs and analyze performance metrics through the dashboard.

Viewing Results in MLflow Dashboard

Launch the MLflow UI to explore evaluation results interactively:

mlflow ui

Navigate to http://localhost:5000 to access the dashboard.

The MLflow dashboard shows the Experiments table with two evaluation runs. Each run displays the run name (like “bold-slug-816”), creation time, dataset information, and duration. You can select runs to compare their performance metrics.

Click on any experiment to see the details of the evaluation. When you scroll down to the Metrics section, you will see detailed evaluation metrics including faithfulness and relevance scores for each question.

Clicking on “Traces” will show you the detailed request-response pairs for each evaluation question for debugging and analysis.

Clicking on “Artifacts” reveals the evaluation results table containing the complete evaluation data, metric scores, and a downloadable format for external analysis.

Interpreting the Results

Raw scores need interpretation to drive improvements. Use MLflow’s evaluation data to identify specific areas for enhancement.

The analysis:

Extracts performance metrics from comprehensive evaluation results
Calculates mean scores across all questions for both metrics
Identifies underperforming questions that require attention
Generates targeted feedback for systematic improvement

def interpret_evaluation_results(evaluation_results):
    """Analyze MLflow evaluation results"""

    # Extract metrics and data
    metrics = evaluation_results.metrics
    eval_table = evaluation_results.tables['eval_results_table']

    # Overall performance
    avg_faithfulness = metrics.get('faithfulness/v1/mean', 0)
    avg_relevance = metrics.get('answer_relevance/v1/mean', 0)

    print(f"Average Scores:")
    print(f"Faithfulness: {avg_faithfulness:.2f}")
    print(f"Answer Relevance: {avg_relevance:.2f}")

    # Identify problematic questions
    low_performing = eval_table[
        (eval_table['faithfulness/v1/score'] < 3) |
        (eval_table['answer_relevance/v1/score'] < 3)
    ]

    if not low_performing.empty:
        print(f"\nQuestions needing improvement: {len(low_performing)}")
        for _, row in low_performing.iterrows():
            print(f"- {row['inputs']}")
    else:
        print("\nAll questions performing well!")

# Usage
interpret_evaluation_results(evaluation_results)

Average Scores:
Faithfulness: 5.00
Answer Relevance: 5.00

All questions performing well!

Perfect scores indicate the RAG system generates accurate, contextual responses without hallucination. This baseline establishes a benchmark for future system modifications and more complex evaluation datasets.

Next Steps

This evaluation framework provides the foundation for systematically improving your RAG system:

Regular Evaluation: Run these metrics on your test dataset with each system change
Threshold Setting: Establish minimum acceptable scores for each metric based on your requirements
Automated Monitoring: Integrate these evaluations into your CI/CD pipeline
Iterative Improvement: Use the insights to guide retrieval improvements, prompt engineering, and model selection

For versioning your ML experiments and models systematically, see our DVC version control guide.

The combination of faithfulness, answer relevance, and retrieval quality metrics gives you a comprehensive view of your RAG system’s performance, enabling data-driven improvements and reliable quality assurance.

Managing Shared Data Science Code with Git Submodules

July 25, 2025

9 Claude Code Techniques I Wish I Had Known Earlier

July 12, 2025

Transparent Calculations and Real-Time Research in One Conversation

July 12, 2025

Build Production-Ready RAG Systems with MLflow Quality Metrics

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

Khuyen Tran

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

What is MLflow GenAI?

Article Overview

Quick Setup

Installation

Environment Configuration

Importing Libraries

RAG System with Ollama Llama3.2

Evaluation Dataset

Core RAG Metrics

Faithfulness Evaluation

Answer Relevance Evaluation

Running and Interpreting Results

Comprehensive Evaluation with MLflow

Viewing Results in MLflow Dashboard

Interpreting the Results

Next Steps

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Build Production-Ready RAG Systems with MLflow Quality Metrics

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

Khuyen Tran

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

What is MLflow GenAI?

Article Overview

Quick Setup

Installation

Environment Configuration

Importing Libraries

RAG System with Ollama Llama3.2

Evaluation Dataset

Core RAG Metrics

Faithfulness Evaluation

Answer Relevance Evaluation

Running and Interpreting Results

Comprehensive Evaluation with MLflow

Viewing Results in MLflow Dashboard

Interpreting the Results

Next Steps

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut