Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Run Private AI Workflows with LangChain and Ollama

Table of Contents

Run Private AI Workflows with LangChain and Ollama

Table of Contents

Why Local AI Matters

AI models are changing data science projects by automating feature engineering, summarizing datasets, generating reports, and even writing code to examine or clean data.

However, using popular APIs like OpenAI or Anthropic can introduce serious privacy risks, especially when handling regulated data such as medical records, legal documents, or internal company knowledge. These services transmit user inputs to remote servers, making it difficult to guarantee confidentiality or data residency compliance.

When data privacy is important, running models locally ensures full control. Nothing leaves your machine, so you manage all inputs, outputs, and processing securely.

That’s where LangChain and Ollama come in. LangChain provides the framework to build AI applications. Ollama lets you run open-source models locally. This guide shows you how to combine both tools to create privacy-preserving AI workflows that process sensitive data exclusively on your own machine.

Introduction to Ollama and LangChain

Before diving into integration steps, let’s understand both tools.

What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. It offers a simple CLI and REST API for downloading and interacting with popular models like Llama, Mistral, DeepSeek, and Gemma—no complex setup required.

Since Ollama doesn’t depend on external APIs, it is ideal for sensitive data or limited-connectivity environments.

What is LangChain?

LangChain is a framework for creating AI applications using language models.

Rather than writing custom code for model interactions, response handling, and error management, you can use LangChain’s ready-made components to build applications, which saves time and reduces boilerplate.

LangChain + Ollama: Integration Tutorial

Now that we understand the core technology, let’s see how to integrate LangChain with Ollama to run models locally.

Installation and Setup

To run local AI models, let’s install both Langchain and Ollama:

pip install langchain langchain-community langchain-ollama

Ollama needs to be installed separately since it’s a standalone service that runs locally:

  • For macOS: Download from ollama.com – this installs both the CLI tool and service
  • For Linux: curl -fsSL https://ollama.com/install.sh | sh – this script sets up both the binary and system service
  • For Windows: Download Windows (Preview) from ollama.com – still in preview mode with some limitations

Start the Ollama server:

ollama serve

The server will run in the background, handling model loading and inference requests.

Pulling Models with Ollama

Before using any model with LangChain, you need to pull it to your local machine with Ollama:

ollama pull qwen3:0.6b

Once it is downloaded, you can serve the model with the following command:

ollama run qwen3:0.6b

The model size has a large impact on performance and resource requirements:

  • Smaller models (7B-8B) run well on most modern computers with 16GB+ RAM
  • Medium models (13B-34B) need more RAM or GPU acceleration
  • Large models (70B+) typically require a dedicated GPU with 24GB+ VRAM

For a full list of models you can serve locally, check out the Ollama model library. Before pulling a model and potentially waste your hardware resources, check out the VRAM calculator that tells you if you can run a specific model on your machine:

VRAM Calculator showing memory requirements for different LLM models across various quantization levels

Basic Chat Integration

Once you have a model downloaded, you need to connect LangChain to Ollama for actual AI interactions. LangChain uses dedicated classes that handle the communication between your Python code and the Ollama service:

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the chat model with specific configurations
chat_model = ChatOllama(
    model="qwen3:0.6b",
    temperature=0.5,
    base_url="http://localhost:11434",  # Can be changed for remote Ollama instances
)

# Create a conversation with system and user messages
messages = [
    SystemMessage(content="You are a helpful coding assistant specialized in Python."),
    HumanMessage(content="Write a recursive Fibonacci function with memoization.")
]

# Invoke the model
response = chat_model.invoke(messages)
print(response.content[:200])

“`plaintext! Okay, I need to write a recursive Fibonacci function with memoization. Let me think about how to approach this.

First, the recursive approach usually involves a function that calculates the …


This snippet:

* Imports `ChatOllama` from `langchain_ollama` to interface with a local or remote Ollama model
* Imports `HumanMessage` and `SystemMessage` for structured message creation
* Initializes the `ChatOllama` instance.
* Constructs a conversation consisting of a`SystemMessage` and a `HumanMessage`.
* Sends the message list to the model using `invoke()`

Under the hood, `ChatOllama`:

1. Converts LangChain message objects into Ollama API format
2. Makes HTTP POST requests to the `/api/chat` endpoint
3. Processes streaming responses when activated
4. Parses the response back into LangChain message objects

The `ChatOllama` class also supports asynchronous operations, allowing data scientists to run multiple model calls in parallel—ideal for building responsive, non-blocking applications like dashboards or chat interfaces:

```python
async def generate_async():
    response = await chat_model.ainvoke(messages)
    return response.content

# In async context
result = await generate_async()
print( result[:200])

“`plaintext! Okay, I need to write a recursive Fibonacci function with memoization. Let me think about how to approach this.

First, the Fibonacci sequence is defined such that each number is the sum of …


### Using Completion Models {#using-completion-models}

Chat models are great for conversation, but data science tasks like code generation, doc completion, and creative writing often benefit from text completion instead. 

The OllamaLLM class supports this mode, letting the model continue from a given input:

```python
from langchain_ollama import OllamaLLM

# Initialize the LLM with specific options
llm = OllamaLLM(
    model="qwen3:0.6b",
)

# Generate text from a prompt
text = """
Write a quick sort algorithm in Python with detailed comments:
```python
def quicksort(
"""

response = llm.invoke(text)
print(response[:500])

“`plaintext! Okay, I need to write a quicksort algorithm in Python with detailed comments. Let me start by recalling how quicksort works. The basic idea is to choose a pivot element, partition the array into elements less than the pivot and greater than it, and then recursively sort each partition. The pivot can be chosen in different ways, like the first element, middle element, or random element.

First, I should define the function signature. The parameters are the array, and maybe a left and …


The difference between `ChatOllama` and `OllamaLLM` classes:

- `OllamaLLM` uses the `/api/generate` endpoint for text completion
- `ChatOllama` uses the `/api/chat` endpoint for chat-style interactions
- Completion is better for code continuation, creative writing, and single-turn prompts
- Chat is better for multi-turn conversations and when using system prompts

For streaming responses (showing tokens as they're generated):

```python
for chunk in llm.stream("Explain quantum computing in three sentences:"):
    print(chunk, end="", flush=True)

“`plaintext! Okay, the user wants me to explain quantum computing in three sentences. Let me start by recalling what I know. Quantum computing uses qubits instead of classical bits. So first sentence should mention qubits and the difference from classical bits. Maybe say “Quantum computing uses qubits, which can exist in multiple states at once, unlike classical bits that are either 0 or 1.”


Use streaming responses to display output in real time, making interactive apps like chatbots feel faster and more responsive.

### Customizing Model Parameters {#customizing-model-parameters}

Both completion and chat models use default settings that work reasonably well, but data science tasks often need more tailored model behavior. For example:

- Scientific analysis needs precise, factual responses
- Creative tasks benefit from more randomness and variety 

Ollama offers fine-grained control over generation parameters:

```python
llm = OllamaLLM(
    model="qwen3:0.6b", # Example model, can be any model supported by Ollama
    temperature=0.7,      # Controls randomness (0.0 = deterministic, 1.0 = creative)
    stop=["```", "###"],  # Stop sequences to end generation
    repeat_penalty=1.1,   # Penalizes repetition (>1.0 reduces repetition)
)

Details about these parameters:

  • model: Specifies the language model to use.
  • temperature: Controls randomness; lower = more focused, higher = more creative.
  • stop: Defines stop sequences that terminate generation early. Once one of these sequences is produced, the model stops generating further tokens.
  • repeat_penalty: Penalizes repeated tokens to reduce redundancy. Values greater than 1.0 discourage the model from repeating itself.

Parameter recommendations:

  • For factual or technical responses: Lower temperature (0.1-0.3) and higher repeat_penalty (1.1-1.2)
  • For creative writing: Higher temperature (0.7-0.9)
  • For code generation: Medium temperature (0.3-0.6) with specific stop like ““`”

The model behavior changes dramatically with these settings. For example:

# Scientific writing with precise output
scientific_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.1, repeat_penalty=1.2)

# Creative storytelling
creative_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.9, repeat_penalty=1.0)

# Code generation
code_llm = OllamaLLM(model="codellama", temperature=0.3, stop=["```", "def "])

Creating LangChain Chains

AI workflows often involve multiple steps: data validation, prompt formatting, model inference, and output processing. Running these steps manually for each request becomes repetitive and error-prone.

LangChain addresses this by chaining steps into sequences to create end-to-end applications:

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import json

# Create a structured prompt template
prompt = PromptTemplate.from_template("""
You are an expert educator.
Explain the following concept in simple terms that a beginner would understand.
Make sure to provide:
1. A clear definition
2. A real-world analogy
3. A practical example

Concept: {concept}
""")

First, we import the required LangChain components and create a prompt template. The PromptTemplate.from_template() method creates a reusable template with placeholder variables (like {concept}) that get filled in at runtime.

# Create a parser that extracts structured data
class JsonOutputParser:
    def parse(self, text):
        try:
            # Find JSON blocks in the text
            if "```json" in text and "```" in text.split("```json")[1]:
                json_str = text.split("```json")[1].split("```")[0].strip()
                return json.loads(json_str)
            # Try to parse the whole text as JSON
            return json.loads(text)
        except:
            # Fall back to returning the raw text
            return {"raw_output": text}

# Initialize a model instance to be used in the chain
llm = OllamaLLM(model="qwen3:0.6b")

Next, we define a custom output parser. This class attempts to extract JSON from the model’s response, handling both code-block format and raw JSON. If parsing fails, it returns the original text wrapped in a dictionary.

# Build a more complex chain
chain = (
    {"concept": RunnablePassthrough()} 
    | prompt 
    | llm 
    | StrOutputParser()
)

# Execute the chain with detailed tracking
result = chain.invoke("Recursive neural networks")
print(result[:500])

Finally, we build the chain using LangChain’s pipe operator (|). The RunnablePassthrough() passes input directly to the prompt template, which formats it and sends it to the LLM. The StrOutputParser() converts the response to a string. Here is the output:

“`output! Okay, so the user is asking for a simple explanation of recursive neural networks. Let me start by breaking down the concept. First, I need to define it clearly. Recursive neural networks… Hmm, I remember they’re a type of neural network that can process data in multiple steps. Wait, maybe I should explain it as networks that can be broken down into smaller parts. Like, they can have multiple layers or multiple levels of processing.

Now, the user wants a real-world analogy…


The chain architecture allows you to:

1. Pre-process inputs before sending to the model
2. Change model outputs into structured data
3. Chain multiple models together
4. Add memory and context management

### Working with Embeddings {#working-with-embeddings}

Embeddings change text into numerical vectors that capture semantic meaning, allowing computers to understand relationships between words and documents mathematically. Ollama supports specialized embedding models that excel at this conversion.

First, let's set up the embedding model and understand what we're working with:

```python
from langchain_ollama import OllamaEmbeddings
import numpy as np

# Initialize embeddings model with specific parameters
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",  # Specialized embedding model that is also supported by Ollama
    base_url="http://localhost:11434",
)

The nomic-embed-text model is designed specifically for creating high-quality text embeddings. Unlike general language models that generate text, embedding models focus solely on converting text into meaningful vector representations.

Now let’s create an embedding for a sample query and examine its properties:

# Create embeddings for a query
query = "How do neural networks learn?"
query_embedding = embeddings.embed_query(query)
print(f"Embedding dimension: {len(query_embedding)}")
Embedding dimension: 768

The 768-dimensional vector represents our query in mathematical space. Each dimension captures different semantic features – some might relate to technical concepts, others to question patterns, and so on. Words with similar meanings will have vectors that point in similar directions.

Diagram showing how text queries are converted to vector embeddings, with similar concepts clustering together in vector space

Next, we’ll create embeddings for multiple documents to demonstrate similarity matching:

# Create embeddings for multiple documents
documents = [
    "Neural networks learn through backpropagation",
    "Transformers use attention mechanisms",
    "LLMs are trained on text data"
]

doc_embeddings = embeddings.embed_documents(documents)

The embed_documents() method processes multiple texts at once, which is more efficient than calling embed_query() repeatedly. This batch processing saves time when working with large document collections.

To find which document best matches our query, we need to measure similarity between vectors. Cosine similarity is the standard approach:

# Calculate similarity between vectors
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Find most similar document to query
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_idx = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.3f}")
Most similar document: Neural networks learn through backpropagation
Similarity score: 0.847

Cosine similarity returns values between -1 and 1, where 1 means identical meaning, 0 means unrelated, and -1 means opposite meanings. Our score of 0.847 indicates strong semantic similarity between the query about neural network learning and the document about backpropagation.

These embeddings support several data science applications:

  1. Semantic search: Find documents by meaning rather than exact keyword matches
  2. Document clustering: Group related research papers, reports, or code documentation
  3. Retrieval-Augmented Generation (RAG): Retrieve relevant context before generating responses
  4. Anomaly detection: Identify unusual or outlier documents in large collections
  5. Content recommendation: Suggest similar articles, datasets, or code examples

When choosing embedding models for your projects, consider these factors:

  • Dimension size: Larger dimensions (1024+) capture more nuance but require more storage and computation
  • Domain specialization: Some models work better for scientific text, others for general content
  • Processing speed: Smaller models like nomic-embed-text balance quality with performance
  • Language support: Multilingual models handle multiple languages but may sacrifice quality for any single language

The quality of your embeddings directly impacts downstream tasks like search relevance and clustering accuracy. Always test different models with your specific data to find the best fit.

Building a question-answering system for your data

Data scientists work with extensive collections of research papers, project documentation, and dataset descriptions. When stakeholders ask questions like “What preprocessing steps were used in the customer churn analysis?” or “Which machine learning models performed best for fraud detection?”, manual document search becomes time-consuming and error-prone.

Standard language models can’t answer these domain-specific questions because they lack access to your particular data and documentation. You need a system that searches your documents and generates accurate, source-backed answers.

Retrieval-Augmented Generation (RAG) solves this problem by combining the semantic search capabilities we just built with text generation. RAG retrieves relevant information from your documents and uses it to answer questions with proper attribution.

Here’s how to build a RAG system using the embeddings and chat models we’ve already configured:

from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
import numpy as np

# Initialize components
embeddings = OllamaEmbeddings(model="nomic-embed-text")
chat_model = ChatOllama(model="qwen3:0.6b", temperature=0.3)

# Sample knowledge base representing project documentation
documents = [
    Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
    Document(page_content="Machine learning algorithms can automatically learn patterns from data without explicit programming."),
    Document(page_content="Data preprocessing involves cleaning, changing, and organizing raw data for analysis."),
    Document(page_content="Neural networks are computational models inspired by biological brain networks."),
]

# Create embeddings for all documents
doc_embeddings = embeddings.embed_documents([doc.page_content for doc in documents])

This setup creates a searchable knowledge base from your documents. In production systems, these documents would contain sections from research papers, methodology descriptions, data analysis reports, or code documentation. The embeddings convert each document into vectors that support semantic search.

def similarity_search(query, top_k=2):
    """Find the most relevant documents for a query"""
    query_embedding = embeddings.embed_query(query)

    # Calculate cosine similarities
    similarities = []
    for doc_emb in doc_embeddings:
        similarity = np.dot(query_embedding, doc_emb) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
        )
        similarities.append(similarity)

    # Get top-k most similar documents
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [documents[i] for i in top_indices]

# Create RAG prompt template
rag_prompt = PromptTemplate.from_template("""
Use the following context to answer the question. If the answer isn't in the context, say so.

Context:
{context}

Question: {question}

Answer:
""")

The similarity_search function finds documents most relevant to a question using the embeddings we created earlier. The prompt template structures how we present retrieved context to the language model, instructing it to base answers on the provided documents rather than general knowledge.

def answer_question(question):
    """Generate an answer using retrieved context"""
    # Retrieve relevant documents
    relevant_docs = similarity_search(question, top_k=2)
    context = "\n".join([doc.page_content for doc in relevant_docs])

    # Generate answer using context
    prompt_text = rag_prompt.format(context=context, question=question)
    response = chat_model.invoke([{"role": "user", "content": prompt_text}])

    return response.content, relevant_docs

# Test the RAG system
question = "What makes Python popular for data science?"
answer, sources = answer_question(question)

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Sources: {[doc.page_content[:50] + '...' for doc in sources]}")
<think>
...
</think>

Python is popular for data science because of its simplicity and readability, which make it easy to learn and use for tasks like data preprocessing.
Sources: ['Python is a high-level programming language known ...', 'Data preprocessing involves cleaning, changing, an...']

The complete RAG system retrieves relevant documents, presents them as context to the language model, and generates answers based on that specific information. This approach grounds responses in your actual documentation rather than the model’s general training data.

RAG systems address several common data science workflow challenges:

  • Project handoffs: New team members can query past work to understand methodologies and results
  • Literature review: Researchers can search large paper collections for relevant techniques and findings
  • Data documentation: Teams can build searchable knowledge bases about datasets, features, and processing steps
  • Reproducibility: Stakeholders can find detailed information about how analyses were conducted

The RAG approach combines semantic search precision with natural language generation fluency. Instead of manually searching through documents or receiving generic answers from language models, you get accurate responses backed by specific sources from your knowledge base.

Conclusion

This tutorial demonstrated how to integrate LangChain with Ollama for local LLM execution. You learned to set up Ollama, download models, and use ChatOllama and OllamaLLM for various tasks. We also covered customizing model parameters, building LangChain chains, and working with embeddings. By running models locally, you maintain data privacy and control, which is suitable for many data science applications. For further learning, refer to the LangChain and Ollama official documentation.

1 thought on “Run Private AI Workflows with LangChain and Ollama”

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran