Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Run Private AI Workflows with LangChain and Ollama

Table of Contents

Run Private AI Workflows with LangChain and Ollama

Why Local AI Matters

AI models are changing data science projects by automating feature engineering, summarizing datasets, generating reports, and even writing code to examine or clean data.

However, using popular APIs like OpenAI or Anthropic can introduce serious privacy risks, especially when handling regulated data such as medical records, legal documents, or internal company knowledge. These services transmit user inputs to remote servers, making it difficult to guarantee confidentiality or data residency compliance.

When data privacy is important, running models locally ensures full control. Nothing leaves your machine, so you manage all inputs, outputs, and processing securely.

That’s where LangChain and Ollama come in. LangChain provides the framework to build AI applications. Ollama lets you run open-source models locally. This guide shows you how to combine both tools to create privacy-preserving AI workflows that process sensitive data exclusively on your own machine.

The source code of this article can be found here:

Introduction to Ollama and LangChain

Before diving into integration steps, let’s understand both tools.

What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. It offers a simple CLI and REST API for downloading and interacting with popular models like Llama, Mistral, DeepSeek, and Gemma—no complex setup required.

Since Ollama doesn’t depend on external APIs, it is ideal for sensitive data or limited-connectivity environments.

What is LangChain?

LangChain is a framework for creating AI applications using language models.

Rather than writing custom code for model interactions, response handling, and error management, you can use LangChain’s ready-made components to build applications, which saves time and reduces boilerplate.

Now that we understand the core technology, let’s see how to integrate LangChain with Ollama to run models locally.

Installation and Setup

Installing Libraries

To run local AI models, let’s install both Langchain and Ollama:

pip install langchain langchain-community langchain-ollama

Ollama needs to be installed separately since it’s a standalone service that runs locally:

  • For macOS: Download from ollama.com
  • For Linux: curl -fsSL https://ollama.com/install.sh | sh
  • For Windows: Download Windows (Preview) from ollama.com

Start the Ollama server:

ollama serve

The server will run in the background, handling model loading and inference requests.

Pulling Models with Ollama

Before using any model with LangChain, you need to pull it to your local machine with Ollama. Here, we will use Qwen2.5:0.5b, a small, efficient language model well-suited for fast local inference and prototyping:

ollama pull qwen2.5:0.5b

Once it is downloaded, you can serve the model with the following command:

ollama run qwen2.5:0.5b

For a full list of models you can serve locally, check out the Ollama model library. Before pulling a model and potentially wasting your hardware resources, check out the VRAM calculator that tells you if you can run a specific model on your machine:

Basic Chat Integration

Once you have a model downloaded, you need to connect LangChain to Ollama for actual AI interactions. LangChain uses dedicated classes that handle the communication between your Python code and the Ollama service:

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama

# Initialize the chat model with specific configurations
chat_model = ChatOllama(
    model="qwen2.5:0.5b",
    temperature=0.3,  # Lower temperature for more deterministic outputs
    base_url="http://localhost:11434",
)

# Define a prompt for generating a basic function in a data science project
messages = [
    SystemMessage(
        content="You are a data scientist who writes efficient Python code."
    ),
    HumanMessage(
        content=(
            "Given a DataFrame with columns 'product', 'year', and 'sales', calculates the total sales for each product over the specified years. ")
    ),
]

# Invoke the model and print the generated function
response = chat_model.invoke(messages)
print(response.content)
```python
import pandas as pd

def calculate_total_sales(df):
    """
    This function calculates the total sales for each product over the specified years.
    
    Parameters:
    - df: A pandas DataFrame containing columns 'product', 'year', and 'sales'.
    
    Returns:
    - A pandas DataFrame with the same structure as input, but with a new column 'total_sales' 
      representing the sum of 'sales' for each product over their respective years.
    """
    # Calculate total sales for each product
    df['total_sales'] = df.groupby('product')['sales'].sum()
    
    return df
```

Under the hood, ChatOllama:

  1. Converts LangChain message objects into Ollama API format
  2. Makes HTTP POST requests to the /api/chat endpoint
  3. Processes streaming responses when activated
  4. Parses the response back into LangChain message objects

The ChatOllama class also supports asynchronous operations, allowing data scientists to run multiple model calls in parallel—ideal for building responsive, non-blocking applications like dashboards or chat interfaces:

async def generate_async():
    response = await chat_model.ainvoke(messages)
    return response.content

# In async context
result = await generate_async()

Using Completion Models

Chat models are great for conversation, but data science tasks like code generation, doc completion, and creative writing often benefit from text completion instead.

The OllamaLLM class supports this mode, letting the model continue from a given input:

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="qwen2.5:0.5b")
text = """
Write a function that takes a DataFrame with columns 'product', 'year', and 'sales' and calculate the total sales for each product over the specified years.

```python
def calculate_total_sales(df):
"""
completion_response = llm.invoke(text)
print(completion_response)
```python
import pandas as pd

def calculate_total_sales(df):
    """
    Calculate the total sales for each product over the specified years.
    
    Args:
    df (pd.DataFrame): DataFrame with columns 'product', 'year', and 'sales'.
    
    Returns:
    pd.DataFrame: A DataFrame containing the products, their total sales, and the average sales per year.
    """

    # Convert the 'sales' column to integer type if it's not already
    df['sales'] = df['sales'].astype(int)

    # Group by 'product', sum the 'sales' for each group
    grouped_sales = df.groupby('product')['sales'].sum().reset_index()

    return grouped_sales
```

The difference between ChatOllama and OllamaLLM classes:

  • OllamaLLM uses the /api/generate endpoint for text completion
  • ChatOllama uses the /api/chat endpoint for chat-style interactions
  • Completion is better for code continuation, creative writing, and single-turn prompts
  • Chat is better for multi-turn conversations and when using system prompts

For streaming responses (showing tokens as they’re generated), use llm.stream:

for chunk in llm.stream(text):
    print(chunk, end="", flush=True)

This displays output in real time, making interactive apps like chatbots feel faster and more responsive.

Customizing Model Parameters

Both completion and chat models use default settings that work reasonably well, but data science tasks often need more tailored model behavior. For example:

  • Scientific analysis needs precise, factual responses
  • Creative tasks benefit from more randomness and variety

Ollama offers fine-grained control over generation parameters:

llm_1 = OllamaLLM(
    model="qwen2.5:0.5b", 
    temperature=0.7, 
    repeat_penalty=1.1,
)

Details about these parameters:

  • model: Specifies the language model to use.
  • temperature: Controls randomness; lower = more focused, higher = more creative.
  • repeat_penalty: Penalizes repeated tokens to reduce redundancy. Values greater than 1.0 discourage the model from repeating itself.

Parameter recommendations:

  • For factual or technical responses: Lower temperature (0.1-0.3) and higher repeat_penalty (1.1-1.2)
  • For creative writing: Higher temperature (0.7-0.9)
  • For code generation: Medium temperature (0.3-0.6)

The model behavior changes dramatically with these settings. For example:

# Scientific writing with precise output
scientific_llm = OllamaLLM(model="qwen2.5:0.5b", temperature=0.1, repeat_penalty=1.2)

# Creative storytelling
creative_llm = OllamaLLM(model="qwen2.5:0.5b", temperature=0.9, repeat_penalty=1.0)

# Code generation
code_llm = OllamaLLM(model="codellama:7b", temperature=0.3)

Creating LangChain Chains

Create a simple chain

In a typical AI workflow, you often want to experiment with different combinations of prompts, models, and output formats. But stitching together these components leads to messy, inflexible code.

LangChain solves this by letting you chain together interchangeable components into a clean, composable pipeline:

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
​
model = OllamaLLM(model="codellama:7b")
function_prompt = PromptTemplate.from_template(
    """
    Write a Python function using pandas that takes a DataFrame with columns '{date_col}', '{group_col}', and '{value_col}'.
    The function should return a new DataFrame that includes a {window}-day rolling average of {value_col} for each {group_col}.
    """
)
​
# Build the chain
code_chain = function_prompt | model | StrOutputParser()
​
# Run the chain with specific variable values
chain_response = code_chain.invoke({
    "date_col": "date",
    "group_col": "store_id",
    "value_col": "sales",
    "window": 7
})
​
print(chain_response)

Output:

def get_rolling_average(df):
    return df.groupby('store_id')['sales'].rolling(window=7).mean()

The code above:

  • Uses PromptTemplate flexible prompt that accepts variables like date_col, group_col, value_col, and window. This makes it easy to reuse the same template for different inputs without rewriting the prompt text.
  • Initializes the code llama:7b model using OllamaLLM, which runs the model locally and handles inference with low latency.
  • Uses StrOutputParser to extract and clean the raw string output from the model, ensuring it returns just the generated code.
  • Uses the | operator to send the prompt output into the model (function_prompt | model) and then the model output into the parser (| StrOutputParser()).
  • Uses the invoke() method to run the entire chain with specific inputs, triggering each component in sequence and returning the final result.

Use one chain as input to another

You can extend this pipeline by combining the code generation chain with additional runnables. For instance, by feeding the output of the code chain into another prompt, you can automatically generate unit tests for the generated function:

test_model = OllamaLLM(model="codellama:7b", temperature=0.3)

test_prompt = PromptTemplate.from_template(
    """
    Given the following Python function:
    ```python
    {code}
    ```
    Write 1–2 simple unit tests for this function using pytest.
    """
)

test_chain = (
    {"code": code_chain}
    | test_prompt
    | test_model
    | StrOutputParser()
)

# Invoke the test chain
test_response = test_chain.invoke({
    "date_col": "date",
    "group_col": "store_id",
    "value_col": "sales",
    "window": 7
})

print(test_response)

Output:

```python
import pandas as pd
from pytest import mark

@mark.parametrize('df', [pd.DataFrame({'store_id': ['A', 'B'], 'sales': [10, 20]})])
def test_get_rolling_average(df):
    result = get_rolling_average(df)
    assert result['store_id'].tolist() == ['A', 'B']
    assert result['sales'].tolist() == [15, 20]

@mark.parametrize('df', [pd.DataFrame({'store_id': ['A', 'B'], 'sales': [10, 20]})])
def test_get_rolling_average_with_window(df):
    result = get_rolling_average(df, window=5)
    assert result['store_id'].tolist() == ['A', 'B']
    assert result['sales'].tolist() == [12.5, 15]
```

The test chain:

  • Initializes a second LLM instance (test_model).
  • Uses a new PromptTemplate to request unit tests that validate the logic of the generated function.
  • Reuses the code_chain output by passing it as the {code} input into a new prompt.
  • Chains everything together using the | operator.
  • Uses invoke() to generate valid pytest test cases, automatically tailored to the generated function based on the chain’s input values.

Answer Questions with Your Data Using RAG

Have you ever been asked questions like “What preprocessing steps were used in the customer churn analysis?” or “Which machine learning models performed best for fraud detection?”—only to find yourself digging through endless documentation? Data scientists often work with large collections of research papers, project notes, and dataset descriptions, making manual search time-consuming and error-prone.

Standard language models can’t answer these domain-specific questions because they lack access to your particular data and documentation. You need a system that searches your documents and generates accurate, source-backed answers.

In this section, you’ll learn how embeddings power semantic search and how to use them in a Retrieval-Augmented Generation (RAG) pipeline to answer questions grounded in your data.

Working with Semantic Search and Embeddings

Semantic search enables more intuitive, context-aware search experiences—for example, matching the question “How do neural networks learn?” to documents that mention backpropagation.

Embeddings change text into numerical vectors that capture semantic meaning, allowing computers to understand relationships between words and documents mathematically.

Ollama supports specialized embedding models that excel at this conversion.

First, let’s set up the embedding model and understand what we’re working with:

import numpy as np
from langchain_ollama import OllamaEmbeddings
​
# Initialize embeddings model with specific parameters
embedder = OllamaEmbeddings(
    model="nomic-embed-text",  # Also supported by Ollama
)

The nomic-embed-text model is designed specifically for creating high-quality text embeddings.

Now let’s create an embedding for a sample query and examine its properties:

# Create embeddings for a query
example_query = "How do neural networks learn?"
example_query_embedding = embedder.embed_query(example_query)
print(f"Embedding dimension: {len(example_query_embedding)}"))

Output:

Embedding dimension: 768

The 768-dimensional vector represents our query in mathematical space. Words with similar meanings will have vectors that point in similar directions.

Next, we’ll create embeddings for multiple documents to demonstrate similarity matching:

documents = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "Machine learning algorithms can automatically learn patterns from data without explicit programming.",
    "Data preprocessing involves cleaning, changing, and organizing raw data for analysis.",
    "Neural networks are computational models inspired by biological brain networks.",
]

doc_embeddings = embedder.embed_documents(documents)

print(f"Generated {len(doc_embeddings)} embeddings for the input documents.")

Output:

Generated 4 embeddings for the input documents.

The embed_documents() method processes multiple texts at once, which is more efficient than calling embed_query() repeatedly. This batch processing saves time when working with large document collections.

Next, let’s write a similarity_search function that takes a user query and returns the most relevant documents based on cosine similarity between embeddings:

# Calculate similarity between vectors
def compute_cosine_similarity(query_vec, document_vec):
    """Compute cosine similarity between two vectors."""
    return np.dot(query_vec, document_vec) / (
        np.linalg.norm(query_vec) * np.linalg.norm(document_vec)
    )
    
def get_most_similar_indices(similarities, num_documents):
    """Return indices of top `num_documents` highest similarity scores."""
    return np.argsort(similarities)[-num_documents:][::-1]
​
​
def similarity_search(query, documents, embedder, num_documents=2):
    """Return the most relevant documents for the given query."""
    query_embedding = embedder.embed_query(query)
    document_embeddings = embedder.embed_documents(documents)
    similarities = [
        compute_cosine_similarity(query_embedding, doc_embedding)
        for doc_embedding in document_embeddings
    ]
    top_indices = get_most_similar_indices(similarities, num_documents)
    return [documents[i] for i in top_indices]

Now let’s test the similarity_search function with a sample question to see how it retrieves the most relevant documents:

question = "What makes Python popular for data science?"
​
relevant_docs = similarity_search(query=question, documents=documents, embedder=embedder)
​
print(f"Top {len(relevant_docs)} relevant documents retrieved:\n")
for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:\n{doc}\n")

Output:

Top 2 relevant documents retrieved:
​
Document 1:
Python is a high-level programming language known for its simplicity and readability.
​
Document 2:
Data preprocessing involves cleaning, changing, and organizing raw data for analysis.

The output shows that the semantic search function successfully identifies the most relevant documents—first selecting a Python overview, then a related topic about data preprocessing.

Building a RAG Pipeline

Next, let’s connect the semantic search function with a language model to build a Retrieval-Augmented Generation (RAG) pipeline. This approach takes the most relevant documents retrieved by semantic search and uses them as context for generating source-backed answers.

Start with creating a prompt template that instructs the language model to base answers on the provided documents rather than general knowledge:

rag_prompt = PromptTemplate.from_template("""
    Use the following context to answer the question. If the answer isn't in the context, say so.
    Context:
    {context}
    
    Question: {question}
    
    Answer:
""")

Combine the retrieved documents into a single string so they can be passed as context to the language model:

context = "\n".join(relevant_docs)

Now, use the embeddings, prompt, and chat model together to construct a RAG system that retrieves context and generates grounded answers:

rag_chat_model = ChatOllama(model="qwen2.5:0.5b", temperature=0.3)
context = "\n".join(relevant_docs)
​
rag_chain = rag_prompt | rag_chat_model | StrOutputParser()
​
# Run the chain with specific variable values
rag_response = rag_chain.invoke({"context": context, "question": question})
​
print(rag_response)

Output:

Python is popular for data science due to several key reasons:<br>​<br>1. **Simplicity and Readability**: The language's syntax is designed to be straightforward and easy to understand, which makes it ideal for beginners and those who are just starting with programming.<br>​<br>2. **Ease of Use**: Python has a large standard library that includes many useful tools and packages for data analysis, machine learning, and scientific computing. This makes it accessible for users without extensive coding experience.<br>​<br>3. **Versatility**: Python is highly versatile, allowing developers to build applications across various domains such as web development, game development, artificial intelligence, and more.<br>...

The output shows Python’s simplicity, readability, and versatility—details clearly drawn from the retrieved content.

RAG systems address several common data science workflow challenges:

  • Project handoffs: New team members can query past work to understand methodologies and results
  • Literature review: Researchers can search large paper collections for relevant techniques and findings
  • Data documentation: Teams can build searchable knowledge bases about datasets, features, and processing steps
  • Reproducibility: Stakeholders can find detailed information about how analyses were conducted

RAG blends the precision of semantic search with the fluency of language generation to deliver answers grounded in your own data. Rather than relying on generic LLM responses or sifting through documents manually, you get source-backed insights tailored to your queries.

Conclusion

This tutorial demonstrated how to integrate LangChain with Ollama for local LLM execution. You learned to set up Ollama, download models, and use ChatOllama and OllamaLLM for various tasks. We also covered customizing model parameters, building LangChain chains, and working with embeddings. By running models locally, you maintain data privacy and control, which is suitable for many data science applications. For further learning, refer to the LangChain and Ollama official documentation.

1 thought on “Run Private AI Workflows with LangChain and Ollama”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran