Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Build Production-Ready RAG Systems with MLflow Quality Metrics

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

Table of Contents

Build Production-Ready RAG Systems with MLflow Quality Metrics

How do you know if your AI model actually works? AI model outputs can be inconsistent – sometimes providing inaccurate responses, irrelevant information, or answers that don’t align with the input context. Manual evaluation of these issues is time-consuming and doesn’t scale as your system grows.

MLflow for GenAI automates evaluation across three critical areas: faithfulness (responses match retrieved context) and answer relevance (outputs address user questions). This guide teaches you to implement these evaluations and systematically improve your AI system’s performance.

What is MLflow GenAI?

MLflow is an open-source platform for managing machine learning lifecycles – tracking experiments, packaging models, and managing deployments. Traditional MLflow focuses on numerical metrics like accuracy and loss.

MLflow for GenAI extends this foundation specifically for generative AI applications. It evaluates subjective qualities that numerical metrics can’t capture:

  • Response relevance: Measures whether outputs address user questions
  • Factual accuracy: Checks if responses stay truthful to source material
  • Context adherence: Evaluates whether answers stick to retrieved information
  • Automated scoring: Uses AI judges instead of manual evaluation
  • Scalable assessment: Handles large datasets without human reviewers

Article Overview

This guide walks you through a complete AI evaluation workflow. You’ll build a RAG (Retrieval-Augmented Generation) system, test it with real data, and measure its performance using automated tools. For comprehensive RAG fundamentals, see our LangChain and Ollama guide.

What you’ll build:

  • RAG system: Create a question-answering system using Ollama’s Llama3.
  • Test dataset: Design evaluation data that reveals system strengths and weaknesses
  • Automated evaluation: Use OpenAI-powered metrics to score response quality
  • MLflow interface: Track experiments and visualize results in an interactive dashboard
  • Results analysis: Interpret scores and identify areas for improvement

Quick Setup

Installation

Start by installing the necessary packages for this guide.

pip install 'mlflow>=3.0.0rc0' langchain-ollama pandas

Environment Configuration

We’ll use Ollama to run Llama3.2 locally for our RAG system. Ollama lets you download and run AI models on your computer, keeping your question-answering data private while eliminating API costs.

Ensure you have Ollama installed locally and the Llama3.2 model downloaded.

# Install Ollama (if not already installed)
# Visit https://ollama.ai for installation instructions

# Pull the Llama3.2 model
ollama pull llama3.2

Importing Libraries

Import the necessary libraries for our RAG system and MLflow evaluation.

import os
import pandas as pd
import mlflow
from mlflow.metrics.genai import faithfulness, answer_relevance, make_genai_metric
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Note: Ensure Ollama is installed and llama3.2 model is available
# Run: ollama pull llama3.2

RAG System with Ollama Llama3.2

We’ll create a real RAG (Retrieval-Augmented Generation) system using Ollama’s Llama3.2 model that retrieves context and generates answers.

This function creates a question-answering system that:

  • Takes a question and available documents as input
  • Uses the most relevant documents to provide context
  • Generates accurate answers using the Llama3.2 model
  • Returns both the answer and the sources used
def ollama_rag_system(question, context_docs):
    """Real RAG system using Ollama Llama3.2"""
    # Retrieve top 2 most relevant documents
    retrieved_context = "\n".join(context_docs[:2])

    # Create prompt template
    prompt = ChatPromptTemplate.from_template(
        """Answer the question based on the provided context.
        Be concise and accurate.

        Context: {context}
        Question: {question}

        Answer:"""
    )

    # Initialize Llama3.2 model
    llm = ChatOllama(model="llama3.2", temperature=0)

    # Create chain and get response
    chain = prompt | llm | StrOutputParser()
    answer = chain.invoke({"context": retrieved_context, "question": question})

    return {
        "answer": answer,
        "retrieved_context": retrieved_context,
        "retrieved_docs": context_docs[:2],
    }

Evaluation Dataset

An evaluation dataset helps you measure system quality systematically. It reveals how well your RAG system handles different question types and identifies areas for improvement.

To create an evaluation dataset, start with a knowledge base of documents that answer questions. Build the dataset with questions, expected answers, and context from this knowledge base.

knowledge_base = [
    "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.",
    "RAG systems combine retrieval and generation to provide accurate, contextual responses. They first retrieve relevant documents then generate answers.",
    "Vector databases store document embeddings for efficient similarity search. They enable fast retrieval of relevant information."
]

eval_data = pd.DataFrame({
    "question": [
        "What is MLflow?",
        "How does RAG work?",
        "What are vector databases used for?"
    ],
    "expected_answer": [
        "MLflow is an open-source platform for managing machine learning workflows",
        "RAG combines retrieval and generation for contextual responses",
        "Vector databases store embeddings for similarity search"
    ],
    "context": [
        knowledge_base[0],
        knowledge_base[1],
        knowledge_base[2]
    ]
})

eval_data
Index Question Expected Answer Context
0 What is MLflow? Open-source ML workflow platform MLflow manages ML lifecycles with tracking, packaging…
1 How does RAG work? Combines retrieval and generation RAG systems retrieve documents then generate answers…
2 What are vector databases used for? Store embeddings for similarity search Vector databases enable fast retrieval of information…

Generate answers for each question using the RAG system. This creates the responses we’ll evaluate for quality and accuracy.

# Generate answers for evaluation
def generate_answers(row):
    result = ollama_rag_system(row['question'], [row['context']])
    return result['answer']

eval_data['generated_answer'] = eval_data.apply(generate_answers, axis=1)

Print the first row to see the question, context, and generated answer.

# Display the first row to see question, context, and answer
print(f"Question: {eval_data.iloc[0]['question']}")
print(f"Context: {eval_data.iloc[0]['context']}")
print(f"Generated Answer: {eval_data.iloc[0]['generated_answer']}")

The output displays three key components:

  • The question shows what we asked.
  • The context shows which documents the system used to generate the answer.
  • The answer contains the RAG system’s response.
Question: What is MLflow?
Context: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides experiment tracking, model packaging, versioning, and deployment capabilities.
Generated Answer: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, providing features such as experiment tracking, model packaging, versioning, and deployment capabilities.

Core RAG Metrics

Faithfulness Evaluation

Faithfulness measures whether the generated answer stays true to the retrieved context, preventing hallucination:

In the code below, we define the function evaluate_faithfulness that:

  • Creates an AI judge using GPT-4 to evaluate faithfulness.
  • Takes the generated answer, question, and context as input.
  • Returns a score from 1-5, where 5 indicates perfect faithfulness.

We then apply this function to the evaluation dataset to get the faithfulness score for each question.

# Evaluate faithfulness for each answer
def evaluate_faithfulness(row):
    # Initialize faithfulness metric with OpenAI GPT-4 as judge
    faithfulness_metric = faithfulness(model="openai:/gpt-4")
    score = faithfulness_metric(
        predictions=[row['generated_answer']],
        inputs=[row['question']],
        context=[row['context']],
    )
    return score.scores[0]

eval_data['faithfulness_score'] = eval_data.apply(evaluate_faithfulness, axis=1)
print("Faithfulness Evaluation Results:")
print(eval_data[['question', 'faithfulness_score']])

Faithfulness Evaluation Results:

Question Faithfulness Score
What is MLflow? 5
How does RAG work? 5
What are vector databases used for? 5

Perfect scores of 5 show the RAG system answers remain faithful to the source material. No hallucination or unsupported claims were detected.

Answer Relevance Evaluation

Answer relevance measures whether the response actually addresses the question asked:

# Evaluate answer relevance
def evaluate_relevance(row):
    # Initialize answer relevance metric
    relevance_metric = answer_relevance(model="openai:/gpt-4")
    score = relevance_metric(
        predictions=[row['generated_answer']],
        inputs=[row['question']]
    )
    return score.scores[0]

eval_data['relevance_score'] = eval_data.apply(evaluate_relevance, axis=1)
print("Answer Relevance Results:")
print(eval_data[['question', 'relevance_score']])

Answer Relevance Results:

Question Relevance Score
What is MLflow? 5
How does RAG work? 5
What are vector databases used for? 5

Perfect scores of 5 show the RAG system’s responses directly address the questions asked. No irrelevant or off-topic answers were generated.

Running and Interpreting Results

We’ll now combine individual metrics into a comprehensive MLflow evaluation. This creates detailed reports, tracks experiments, and enables result comparison. Finally, we’ll analyze the scores to identify areas for improvement.

Comprehensive Evaluation with MLflow

Start by using MLflow’s evaluation framework to run all metrics together.

The following code:

  • Defines a model function that MLflow can evaluate systematically
  • Takes a DataFrame of questions and processes them through the RAG system
  • Converts results to a list format required by MLflow
  • Combines all metrics into a single evaluation run for comprehensive reporting
# Prepare data for MLflow evaluation
def rag_model_function(input_df):
    """Model function for MLflow evaluation"""
    def process_row(row):
        result = ollama_rag_system(row["question"], [row["context"]])
        return result["answer"]

    return input_df.apply(process_row, axis=1).tolist()


# Run comprehensive evaluation
with mlflow.start_run() as run:
    evaluation_results = mlflow.evaluate(
        model=rag_model_function,
        data=eval_data[
            ["question", "context", "expected_answer"]
        ],  # Include expected_answer column
        targets="expected_answer",
        extra_metrics=[faithfulness_metric, relevance_metric],
        evaluator_config={
            "col_mapping": {
                "inputs": "question",
                "context": "context",
                "predictions": "predictions",
                "targets": "expected_answer",
            }
        },
    )

After running the code, the evaluation results get stored in MLflow’s tracking system. You can now compare different runs and analyze performance metrics through the dashboard.

Viewing Results in MLflow Dashboard

Launch the MLflow UI to explore evaluation results interactively:

mlflow ui

Navigate to http://localhost:5000 to access the dashboard.

The MLflow dashboard shows the Experiments table with two evaluation runs. Each run displays the run name (like “bold-slug-816”), creation time, dataset information, and duration. You can select runs to compare their performance metrics.

MLflow evaluation comparison interface showing side-by-side model performance metrics with scores for faithfulness and relevance

Click on any experiment to see the details of the evaluation. When you scroll down to the Metrics section, you will see detailed evaluation metrics including faithfulness and relevance scores for each question.

MLflow metrics dashboard displaying aggregated evaluation scores with average faithfulness and answer relevance ratings

Clicking on “Traces” will show you the detailed request-response pairs for each evaluation question for debugging and analysis.

MLflow traces view showing detailed execution logs and evaluation chain for RAG model responses

Clicking on “Artifacts” reveals the evaluation results table containing the complete evaluation data, metric scores, and a downloadable format for external analysis.

MLflow artifacts panel displaying saved model outputs, evaluation datasets, and generated reports

Interpreting the Results

Raw scores need interpretation to drive improvements. Use MLflow’s evaluation data to identify specific areas for enhancement.

The analysis:

  • Extracts performance metrics from comprehensive evaluation results
  • Calculates mean scores across all questions for both metrics
  • Identifies underperforming questions that require attention
  • Generates targeted feedback for systematic improvement
def interpret_evaluation_results(evaluation_results):
    """Analyze MLflow evaluation results"""

    # Extract metrics and data
    metrics = evaluation_results.metrics
    eval_table = evaluation_results.tables['eval_results_table']

    # Overall performance
    avg_faithfulness = metrics.get('faithfulness/v1/mean', 0)
    avg_relevance = metrics.get('answer_relevance/v1/mean', 0)

    print(f"Average Scores:")
    print(f"Faithfulness: {avg_faithfulness:.2f}")
    print(f"Answer Relevance: {avg_relevance:.2f}")

    # Identify problematic questions
    low_performing = eval_table[
        (eval_table['faithfulness/v1/score'] < 3) |
        (eval_table['answer_relevance/v1/score'] < 3)
    ]

    if not low_performing.empty:
        print(f"\nQuestions needing improvement: {len(low_performing)}")
        for _, row in low_performing.iterrows():
            print(f"- {row['inputs']}")
    else:
        print("\nAll questions performing well!")

# Usage
interpret_evaluation_results(evaluation_results)
Average Scores:
Faithfulness: 5.00
Answer Relevance: 5.00

All questions performing well!

Perfect scores indicate the RAG system generates accurate, contextual responses without hallucination. This baseline establishes a benchmark for future system modifications and more complex evaluation datasets.

Next Steps

This evaluation framework provides the foundation for systematically improving your RAG system:

  1. Regular Evaluation: Run these metrics on your test dataset with each system change
  2. Threshold Setting: Establish minimum acceptable scores for each metric based on your requirements
  3. Automated Monitoring: Integrate these evaluations into your CI/CD pipeline
  4. Iterative Improvement: Use the insights to guide retrieval improvements, prompt engineering, and model selection

The combination of faithfulness, answer relevance, and retrieval quality metrics gives you a comprehensive view of your RAG system’s performance, enabling data-driven improvements and reliable quality assurance.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran