Table of Contents
- Why Local AI Matters
- Introduction to Ollama and LangChain
- LangChain + Ollama: Integration Tutorial
- Conclusion
Why Local AI Matters
AI models are changing data science projects by automating feature engineering, summarizing datasets, generating reports, and even writing code to examine or clean data.
However, using popular APIs like OpenAI or Anthropic can introduce serious privacy risks, especially when handling regulated data such as medical records, legal documents, or internal company knowledge. These services transmit user inputs to remote servers, making it difficult to guarantee confidentiality or data residency compliance.
When data privacy is important, running models locally ensures full control. Nothing leaves your machine, so you manage all inputs, outputs, and processing securely.
That’s where LangChain and Ollama come in. LangChain provides the framework to build AI applications. Ollama lets you run open-source models locally. This guide shows you how to combine both tools to create privacy-preserving AI workflows that process sensitive data exclusively on your own machine.
Introduction to Ollama and LangChain
Before diving into integration steps, let’s understand both tools.
What is Ollama?
Ollama is an open-source tool that makes it easy to run large language models locally. It offers a simple CLI and REST API for downloading and interacting with popular models like Llama, Mistral, DeepSeek, and Gemma—no complex setup required.
Since Ollama doesn’t depend on external APIs, it is ideal for sensitive data or limited-connectivity environments.
What is LangChain?
LangChain is a framework for creating AI applications using language models.
Rather than writing custom code for model interactions, response handling, and error management, you can use LangChain’s ready-made components to build applications, which saves time and reduces boilerplate.
LangChain + Ollama: Integration Tutorial
Now that we understand the core technology, let’s see how to integrate LangChain with Ollama to run models locally.
Installation and Setup
To run local AI models, let’s install both Langchain and Ollama:
pip install langchain langchain-community langchain-ollama
Ollama needs to be installed separately since it’s a standalone service that runs locally:
- For macOS: Download from ollama.com – this installs both the CLI tool and service
- For Linux:
curl -fsSL https://ollama.com/install.sh | sh
– this script sets up both the binary and system service - For Windows: Download Windows (Preview) from ollama.com – still in preview mode with some limitations
Start the Ollama server:
ollama serve
The server will run in the background, handling model loading and inference requests.
Pulling Models with Ollama
Before using any model with LangChain, you need to pull it to your local machine with Ollama:
ollama pull qwen3:0.6b
Once it is downloaded, you can serve the model with the following command:
ollama run qwen3:0.6b
The model size has a large impact on performance and resource requirements:
- Smaller models (7B-8B) run well on most modern computers with 16GB+ RAM
- Medium models (13B-34B) need more RAM or GPU acceleration
- Large models (70B+) typically require a dedicated GPU with 24GB+ VRAM
For a full list of models you can serve locally, check out the Ollama model library. Before pulling a model and potentially waste your hardware resources, check out the VRAM calculator that tells you if you can run a specific model on your machine:
Basic Chat Integration
Once you have a model downloaded, you need to connect LangChain to Ollama for actual AI interactions. LangChain uses dedicated classes that handle the communication between your Python code and the Ollama service:
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
# Initialize the chat model with specific configurations
chat_model = ChatOllama(
model="qwen3:0.6b",
temperature=0.5,
base_url="http://localhost:11434", # Can be changed for remote Ollama instances
)
# Create a conversation with system and user messages
messages = [
SystemMessage(content="You are a helpful coding assistant specialized in Python."),
HumanMessage(content="Write a recursive Fibonacci function with memoization.")
]
# Invoke the model
response = chat_model.invoke(messages)
print(response.content[:200])
“`plaintext!
First, the recursive approach usually involves a function that calculates the …
This snippet:
* Imports `ChatOllama` from `langchain_ollama` to interface with a local or remote Ollama model
* Imports `HumanMessage` and `SystemMessage` for structured message creation
* Initializes the `ChatOllama` instance.
* Constructs a conversation consisting of a`SystemMessage` and a `HumanMessage`.
* Sends the message list to the model using `invoke()`
Under the hood, `ChatOllama`:
1. Converts LangChain message objects into Ollama API format
2. Makes HTTP POST requests to the `/api/chat` endpoint
3. Processes streaming responses when activated
4. Parses the response back into LangChain message objects
The `ChatOllama` class also supports asynchronous operations, allowing data scientists to run multiple model calls in parallel—ideal for building responsive, non-blocking applications like dashboards or chat interfaces:
```python
async def generate_async():
response = await chat_model.ainvoke(messages)
return response.content
# In async context
result = await generate_async()
print( result[:200])
“`plaintext!
First, the Fibonacci sequence is defined such that each number is the sum of …
### Using Completion Models {#using-completion-models}
Chat models are great for conversation, but data science tasks like code generation, doc completion, and creative writing often benefit from text completion instead.
The OllamaLLM class supports this mode, letting the model continue from a given input:
```python
from langchain_ollama import OllamaLLM
# Initialize the LLM with specific options
llm = OllamaLLM(
model="qwen3:0.6b",
)
# Generate text from a prompt
text = """
Write a quick sort algorithm in Python with detailed comments:
```python
def quicksort(
"""
response = llm.invoke(text)
print(response[:500])
“`plaintext!
First, I should define the function signature. The parameters are the array, and maybe a left and …
The difference between `ChatOllama` and `OllamaLLM` classes:
- `OllamaLLM` uses the `/api/generate` endpoint for text completion
- `ChatOllama` uses the `/api/chat` endpoint for chat-style interactions
- Completion is better for code continuation, creative writing, and single-turn prompts
- Chat is better for multi-turn conversations and when using system prompts
For streaming responses (showing tokens as they're generated):
```python
for chunk in llm.stream("Explain quantum computing in three sentences:"):
print(chunk, end="", flush=True)
“`plaintext!
…
Use streaming responses to display output in real time, making interactive apps like chatbots feel faster and more responsive.
### Customizing Model Parameters {#customizing-model-parameters}
Both completion and chat models use default settings that work reasonably well, but data science tasks often need more tailored model behavior. For example:
- Scientific analysis needs precise, factual responses
- Creative tasks benefit from more randomness and variety
Ollama offers fine-grained control over generation parameters:
```python
llm = OllamaLLM(
model="qwen3:0.6b", # Example model, can be any model supported by Ollama
temperature=0.7, # Controls randomness (0.0 = deterministic, 1.0 = creative)
stop=["```", "###"], # Stop sequences to end generation
repeat_penalty=1.1, # Penalizes repetition (>1.0 reduces repetition)
)
Details about these parameters:
model
: Specifies the language model to use.temperature
: Controls randomness; lower = more focused, higher = more creative.stop
: Defines stop sequences that terminate generation early. Once one of these sequences is produced, the model stops generating further tokens.repeat_penalty
: Penalizes repeated tokens to reduce redundancy. Values greater than 1.0 discourage the model from repeating itself.
Parameter recommendations:
- For factual or technical responses: Lower
temperature
(0.1-0.3) and higherrepeat_penalty
(1.1-1.2) - For creative writing: Higher
temperature
(0.7-0.9) - For code generation: Medium
temperature
(0.3-0.6) with specificstop
like ““`”
The model behavior changes dramatically with these settings. For example:
# Scientific writing with precise output
scientific_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.1, repeat_penalty=1.2)
# Creative storytelling
creative_llm = OllamaLLM(model="qwen3:0.6b", temperature=0.9, repeat_penalty=1.0)
# Code generation
code_llm = OllamaLLM(model="codellama", temperature=0.3, stop=["```", "def "])
Creating LangChain Chains
AI workflows often involve multiple steps: data validation, prompt formatting, model inference, and output processing. Running these steps manually for each request becomes repetitive and error-prone.
LangChain addresses this by chaining steps into sequences to create end-to-end applications:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import json
# Create a structured prompt template
prompt = PromptTemplate.from_template("""
You are an expert educator.
Explain the following concept in simple terms that a beginner would understand.
Make sure to provide:
1. A clear definition
2. A real-world analogy
3. A practical example
Concept: {concept}
""")
First, we import the required LangChain components and create a prompt template. The PromptTemplate.from_template()
method creates a reusable template with placeholder variables (like {concept}
) that get filled in at runtime.
# Create a parser that extracts structured data
class JsonOutputParser:
def parse(self, text):
try:
# Find JSON blocks in the text
if "```json" in text and "```" in text.split("```json")[1]:
json_str = text.split("```json")[1].split("```")[0].strip()
return json.loads(json_str)
# Try to parse the whole text as JSON
return json.loads(text)
except:
# Fall back to returning the raw text
return {"raw_output": text}
# Initialize a model instance to be used in the chain
llm = OllamaLLM(model="qwen3:0.6b")
Next, we define a custom output parser. This class attempts to extract JSON from the model’s response, handling both code-block format and raw JSON. If parsing fails, it returns the original text wrapped in a dictionary.
# Build a more complex chain
chain = (
{"concept": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Execute the chain with detailed tracking
result = chain.invoke("Recursive neural networks")
print(result[:500])
Finally, we build the chain using LangChain’s pipe operator (|
). The RunnablePassthrough()
passes input directly to the prompt template, which formats it and sends it to the LLM. The StrOutputParser()
converts the response to a string. Here is the output:
“`output!
Now, the user wants a real-world analogy…
The chain architecture allows you to:
1. Pre-process inputs before sending to the model
2. Change model outputs into structured data
3. Chain multiple models together
4. Add memory and context management
### Working with Embeddings {#working-with-embeddings}
Embeddings change text into numerical vectors that capture semantic meaning, allowing computers to understand relationships between words and documents mathematically. Ollama supports specialized embedding models that excel at this conversion.
First, let's set up the embedding model and understand what we're working with:
```python
from langchain_ollama import OllamaEmbeddings
import numpy as np
# Initialize embeddings model with specific parameters
embeddings = OllamaEmbeddings(
model="nomic-embed-text", # Specialized embedding model that is also supported by Ollama
base_url="http://localhost:11434",
)
The nomic-embed-text
model is designed specifically for creating high-quality text embeddings. Unlike general language models that generate text, embedding models focus solely on converting text into meaningful vector representations.
Now let’s create an embedding for a sample query and examine its properties:
# Create embeddings for a query
query = "How do neural networks learn?"
query_embedding = embeddings.embed_query(query)
print(f"Embedding dimension: {len(query_embedding)}")
Embedding dimension: 768
The 768-dimensional vector represents our query in mathematical space. Each dimension captures different semantic features – some might relate to technical concepts, others to question patterns, and so on. Words with similar meanings will have vectors that point in similar directions.
Next, we’ll create embeddings for multiple documents to demonstrate similarity matching:
# Create embeddings for multiple documents
documents = [
"Neural networks learn through backpropagation",
"Transformers use attention mechanisms",
"LLMs are trained on text data"
]
doc_embeddings = embeddings.embed_documents(documents)
The embed_documents()
method processes multiple texts at once, which is more efficient than calling embed_query()
repeatedly. This batch processing saves time when working with large document collections.
To find which document best matches our query, we need to measure similarity between vectors. Cosine similarity is the standard approach:
# Calculate similarity between vectors
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Find most similar document to query
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
most_similar_idx = np.argmax(similarities)
print(f"Most similar document: {documents[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.3f}")
Most similar document: Neural networks learn through backpropagation
Similarity score: 0.847
Cosine similarity returns values between -1 and 1, where 1 means identical meaning, 0 means unrelated, and -1 means opposite meanings. Our score of 0.847 indicates strong semantic similarity between the query about neural network learning and the document about backpropagation.
These embeddings support several data science applications:
- Semantic search: Find documents by meaning rather than exact keyword matches
- Document clustering: Group related research papers, reports, or code documentation
- Retrieval-Augmented Generation (RAG): Retrieve relevant context before generating responses
- Anomaly detection: Identify unusual or outlier documents in large collections
- Content recommendation: Suggest similar articles, datasets, or code examples
When choosing embedding models for your projects, consider these factors:
- Dimension size: Larger dimensions (1024+) capture more nuance but require more storage and computation
- Domain specialization: Some models work better for scientific text, others for general content
- Processing speed: Smaller models like
nomic-embed-text
balance quality with performance - Language support: Multilingual models handle multiple languages but may sacrifice quality for any single language
The quality of your embeddings directly impacts downstream tasks like search relevance and clustering accuracy. Always test different models with your specific data to find the best fit.
Building a question-answering system for your data
Data scientists work with extensive collections of research papers, project documentation, and dataset descriptions. When stakeholders ask questions like “What preprocessing steps were used in the customer churn analysis?” or “Which machine learning models performed best for fraud detection?”, manual document search becomes time-consuming and error-prone.
Standard language models can’t answer these domain-specific questions because they lack access to your particular data and documentation. You need a system that searches your documents and generates accurate, source-backed answers.
Retrieval-Augmented Generation (RAG) solves this problem by combining the semantic search capabilities we just built with text generation. RAG retrieves relevant information from your documents and uses it to answer questions with proper attribution.
Here’s how to build a RAG system using the embeddings and chat models we’ve already configured:
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
import numpy as np
# Initialize components
embeddings = OllamaEmbeddings(model="nomic-embed-text")
chat_model = ChatOllama(model="qwen3:0.6b", temperature=0.3)
# Sample knowledge base representing project documentation
documents = [
Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
Document(page_content="Machine learning algorithms can automatically learn patterns from data without explicit programming."),
Document(page_content="Data preprocessing involves cleaning, changing, and organizing raw data for analysis."),
Document(page_content="Neural networks are computational models inspired by biological brain networks."),
]
# Create embeddings for all documents
doc_embeddings = embeddings.embed_documents([doc.page_content for doc in documents])
This setup creates a searchable knowledge base from your documents. In production systems, these documents would contain sections from research papers, methodology descriptions, data analysis reports, or code documentation. The embeddings convert each document into vectors that support semantic search.
def similarity_search(query, top_k=2):
"""Find the most relevant documents for a query"""
query_embedding = embeddings.embed_query(query)
# Calculate cosine similarities
similarities = []
for doc_emb in doc_embeddings:
similarity = np.dot(query_embedding, doc_emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
)
similarities.append(similarity)
# Get top-k most similar documents
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [documents[i] for i in top_indices]
# Create RAG prompt template
rag_prompt = PromptTemplate.from_template("""
Use the following context to answer the question. If the answer isn't in the context, say so.
Context:
{context}
Question: {question}
Answer:
""")
The similarity_search
function finds documents most relevant to a question using the embeddings we created earlier. The prompt template structures how we present retrieved context to the language model, instructing it to base answers on the provided documents rather than general knowledge.
def answer_question(question):
"""Generate an answer using retrieved context"""
# Retrieve relevant documents
relevant_docs = similarity_search(question, top_k=2)
context = "\n".join([doc.page_content for doc in relevant_docs])
# Generate answer using context
prompt_text = rag_prompt.format(context=context, question=question)
response = chat_model.invoke([{"role": "user", "content": prompt_text}])
return response.content, relevant_docs
# Test the RAG system
question = "What makes Python popular for data science?"
answer, sources = answer_question(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Sources: {[doc.page_content[:50] + '...' for doc in sources]}")
<think>
...
</think>
Python is popular for data science because of its simplicity and readability, which make it easy to learn and use for tasks like data preprocessing.
Sources: ['Python is a high-level programming language known ...', 'Data preprocessing involves cleaning, changing, an...']
The complete RAG system retrieves relevant documents, presents them as context to the language model, and generates answers based on that specific information. This approach grounds responses in your actual documentation rather than the model’s general training data.
RAG systems address several common data science workflow challenges:
- Project handoffs: New team members can query past work to understand methodologies and results
- Literature review: Researchers can search large paper collections for relevant techniques and findings
- Data documentation: Teams can build searchable knowledge bases about datasets, features, and processing steps
- Reproducibility: Stakeholders can find detailed information about how analyses were conducted
The RAG approach combines semantic search precision with natural language generation fluency. Instead of manually searching through documents or receiving generic answers from language models, you get accurate responses backed by specific sources from your knowledge base.
Conclusion
This tutorial demonstrated how to integrate LangChain with Ollama for local LLM execution. You learned to set up Ollama, download models, and use ChatOllama
and OllamaLLM
for various tasks. We also covered customizing model parameters, building LangChain chains, and working with embeddings. By running models locally, you maintain data privacy and control, which is suitable for many data science applications. For further learning, refer to the LangChain and Ollama official documentation.
1 thought on “Run Private AI Workflows with LangChain and Ollama”
I just like the helpful information you provide in your articles