Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Pinecone Vector Database Tutorial: Semantic Search with Ollama and Pinecone

Table of Contents

Pinecone Vector Database Tutorial: Semantic Search with Ollama and Pinecone

Table of Contents

Introduction

Have you ever searched for something online and been frustrated when the search engine couldn’t find what you were looking for, even though you knew it existed? Traditional search relies on exact word matching, so searching for “Denim for men” won’t return products labeled “Men’s Jeans 32W” or “Men’s Trendy Jeans.”

Semantic search solves this by understanding meaning and context. It recognizes that “denim” relates to “jeans” and “men” connects to “32W” in clothing contexts, returning relevant results despite different wording.

This comprehensive Pinecone vector database tutorial will teach you to build a complete semantic search pipeline using Ollama embeddings for local processing. You’ll learn to implement a Python vector database solution that combines the power of Pinecone vector database with Ollama embeddings and LangChain orchestration.

By the end of this semantic search tutorial, you’ll have a fully functional system that processes documents locally and stores them in a scalable vector database for lightning-fast similarity searches.

What Is Pinecone?

Pinecone is a vector database designed specifically for storing and searching high-dimensional data. But what does that mean for data scientists?

Think of traditional databases like spreadsheets: they store text, numbers, and dates in rows and columns. Vector databases like Pinecone store mathematical representations of data called vectors (arrays of numbers that capture meaning).

Here’s a simple comparison

Traditional Database:

ID Product Name Category
1 Men’s Jeans 32W Clothing
2 Denim Pants Apparel

Vector Database:

ID Product Vector
1 [0.2, 0.8, 0.1, 0.9, …]
2 [0.3, 0.7, 0.2, 0.8, …]

By storing data as vectors, Pinecone enables semantic understanding beyond what traditional databases can achieve. Similar products have similar vector representations, allowing Pinecone to find relevant results based on meaning rather than exact keyword matches.

The visualization demonstrates semantic search in action where distance between dots represents similarity. The dashed circle shows how “denim” products (pink and blue dots) cluster close together, while the leather jacket (green dot) sits farther away, indicating lower semantic similarity.

Pinecone stands out among vector databases with its cloud-native design that eliminates infrastructure management:

  • Serverless architecture: No need to manage clusters or configure hardware unlike self-hosted options like Weaviate or Qdrant
  • Automatic scaling: Handles millions of vectors without manual intervention, unlike pgvector which requires PostgreSQL tuning
  • Built-in monitoring: Provides performance metrics and alerts out-of-the-box, reducing operational overhead
  • Optimized indexing: Uses proprietary algorithms for sub-millisecond search across billion-scale datasets

What Are Ollama and LangChain?

Ollama provides free, local LLM hosting with complete data privacy, unlike paid cloud APIs like OpenAI or Claude.

LangChain serves as the orchestration framework for building LLM applications efficiently. For comprehensive LangChain fundamentals, see our LangChain and Ollama guide.

Overview of the Architecture

The architecture follows a straightforward design:

  • Ollama: Generates embeddings locally
  • Pinecone: Stores vectors and performs similarity search
  • LangChain: Orchestrates the entire pipeline

Step-by-Step Implementation

Ollama Setup

We can setup Ollama locally by downloading and installing it first. For Linux users, run the installation script:

# For Linux users
curl -fsSL https://ollama.com/install.sh | sh

Once installation completes, start the Ollama server locally.

ollama serve

Next, download a model specifically designed for generating embeddings. We’ll use mxbai-embed-large, which is optimized for semantic search tasks.

ollama pull mxbai-embed-large

With Ollama installed and the embedding model downloaded, we’re ready to start coding.

Loading Text Data

Next, we’ll prepare our text data for embedding generation. For this tutorial, we’ll work with PDF documents stored in a data folder using LangChain’s document loader.

from langchain.schema import Document
from langchain_community.document_loaders import PyPDFDirectoryLoader

# read/load the pdf document from the directory
def read_doc(directory):
    file_loader= PyPDFDirectoryLoader(directory)
    document = file_loader.load()
    if not document:
            raise ValueError("No documents found in the specified directory.")
    return document

docs = read_doc("data/")

The documents we are considering in this tutorial are research papers due to their symmetrical format. Before generating embeddings, we need to split large documents into smaller chunks through a process called text chunking.

Text Chunking

Text chunking is essential for processing large documents before embedding generation. The process breaks down documents into smaller, manageable pieces that embedding models can process effectively within their token limits.

Key concepts for chunking are:

  • Chunk size (1000 chars): Maximum characters per chunk, ensuring each piece fits within embedding model token limits.
  • Chunk overlap (50 chars): Characters that overlap between adjacent chunks to prevent important information from being split. (Example: Chunk 1 ends with “…model performance” and Chunk 2 starts with “model performance improves…”)
  • Hierarchical separators: Natural text boundaries like paragraphs (\n\n), sentences (.), and words () that preserve meaning when splitting
  • Minimum length filtering (50 chars): Removes chunks too short to contain meaningful semantic information. (Example: Filters out headers like “Introduction” or “Figure 1”)
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_data(docs, chunk_size=1000, chunk_overlap=50, min_length=50):
    """Split documents into smaller chunks for embedding generation."""

    # Configure text splitter with hierarchical separators
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
        add_start_index=True
    )

    # Split documents into chunks
    chunks = text_splitter.split_documents(docs)

    # Filter out headers and short chunks
    filtered_chunks = [
        chunk for chunk in chunks
        if not chunk.page_content.startswith("NeurIPS 2023") and
           len(chunk.page_content.strip()) >= min_length
    ]

    return filtered_chunks

# Process documents into chunks
chunks = chunk_data(docs)

The hierarchical separators ["\n\n", "\n", ".", "!", "?", ",", " ", ""] ensure text breaks at natural boundaries. The splitter prefers paragraph breaks, then line breaks, then sentence endings, maintaining content integrity while creating optimal chunk sizes for semantic search.

Embeddings

The embeddings will be generated in two steps:

  • Initializing the OllamaEmbeddings.
  • Saving the embeddings in Pinecone

Python Vector Database with Ollama Embeddings

We can simply generate Ollama embeddings using a text model (mxbai-embed-large in this case). This approach provides local embedding generation for our Python vector database implementation, ensuring data privacy and reducing API costs.

from langchain_ollama.embeddings import OllamaEmbeddings

# Initialize Ollama embeddings for Python vector database
# Using mxbai-embed-large model for high-quality embeddings
embeddings = OllamaEmbeddings(model="mxbai-embed-large")

LangChain Pinecone Integration Setup

Now, we will configure our Pinecone vector database for the LangChain Pinecone integration. First things first, go to the Pinecone website and make a (free) user account. After making your account, you will be able to get Pinecone’s API key for your Python vector database setup. Before setting up the Pinecone vector database, please set up the Pinecone API key in the .env file.

The .env file will look something like this:

PINECONE_API_KEY="your-api-key"

Pinecone organizes vector data in indexes, with records partitioned into namespaces within each index. We’ll create an index named semantic-search-local with 1024 dimensions using cosine similarity.

The free tier restricts us to the us-east-1 region, and we’ll enable deletion protection to prevent accidental data loss. Here’s how to set up the index:

import os
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
from dotenv import load_dotenv
import time

load_dotenv()

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index_name = "semantic-search-local"
namespace = "langchain-ollama"

# Check if index already exists
existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

# Create index if it doesn't exist
if index_name not in existing_indexes:
    print(f"Creating index '{index_name}'...")
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        deletion_protection="enabled"
    )

    # Wait for index to be ready
    print("Waiting for index to be ready...")
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)
    print(f"Index '{index_name}' created and ready.")
else:
    print(f"Index '{index_name}' already exists. Connecting...")

Additionally, if you want to have a quick look at the index, you can do so using the describe_index_stats() .

# Connect to index
index = pc.Index(index_name)
index.describe_index_stats()
Index 'semantic-search-local' already exists. Connecting...
{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'semantic-search-local-namespace': {'vector_count': 139}},
 'total_vector_count': 139,
 'vector_type': 'dense'}

And now, we have populated the vector store/db with the embeddings.

vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    index_name=index_name,
    embedding=embeddings,
    namespace=index_name + "-namespace",
)

Implementing Semantic Search with Python Vector Database

With our Ollama embeddings stored in the Pinecone vector database, we can now perform semantic search using our Python vector database implementation. The process follows three simple steps common to all vector databases.

The search methodology involves converting queries to Ollama embeddings, comparing them with stored vectors in the Pinecone vector database, and retrieving the most similar results. We will:

  • Use k=2 to get the top 2 matches
  • Set a similarity threshold of 0.5 to filter out low-quality results.
# cosine similarity search
# get the data from the database itself (VectorDB)
def retrieve_query(query, k=2, score_threshold=0.5):
    matching_result = vector_store.similarity_search_with_score(query, k=k)
    filtered = [r for r in matching_result if r[1] >= score_threshold]
    return filtered

Now, we will check some queries and also see the similarity score.

query = "What were the key findings of the NeurIPS 2023 LLM Efficiency Fine-tuning Competition?"
x = retrieve_query(query=query, k=1, score_threshold=0.5)
for match, score in x:
    print(f"Score: {score:.3f}")
    print(match.page_content[:300])  # Show preview
    print("-" * 50)

The above code will generate an output similar to this:

Score: 0.736
for generative models and demonstrate the need for more robust evaluation meth-
ods. Notably, the winning submissions utilized standard open-source libraries and
focused primarily on data curation. To facilitate further research and promote
reproducibility, we release all competition entries, Docker

The similarity score of 0.736 indicates a strong semantic match between our query and the retrieved text. This score exceeds our 0.5 threshold, confirming the result’s relevance.

Since we set k=1, we’re retrieving only the single most similar document. This result represents the best semantic match in our entire document collection for the given query.

Measuring Basic Performance Measures

Performance monitoring is crucial for production systems, so let’s measure our search latency. This gives us baseline metrics for optimization decisions.

We’ll test with the same research-focused queries from our earlier example. These represent typical academic questions users might ask about NeurIPS papers:

test_queries = [
    "What were the key findings of the NeurIPS 2023 LLM Efficiency Fine-tuning Competition?",
    "How does self-preference bias manifest in LLM evaluators, and what evidence supports this?",
]

Search times will vary based on dataset size, network conditions, and query complexity. The measurements below provide a starting point for performance expectations.

def measure_latency(queries, k=3):
    latencies = []
    for query in queries:
        start = time.time()
        results = retrieve_query(query, k)
        duration = time.time() - start
        latencies.append(duration)
        print(f"Query: {query[:50]}... | Search Time: {duration:.2f}s")
    avg_latency = sum(latencies) / len(latencies) if latencies else 0
    print(f"Average Search Time: {avg_latency:.2f}s")
    return results

measure_latency(test_queries, k=3)
Query: What were the key findings of the NeurIPS 2023 LLM... | Search Time: 0.74s
Query: How does self-preference bias manifest in LLM eval... | Search Time: 0.48s
Query: What evaluation metrics were used in the NeurIPS 2... | Search Time: 0.45s
Query: How can fine-tuning LLMs for self-recognition affe... | Search Time: 0.47s
Average Search Time: 0.54s

These search times (averaging 0.54 seconds) are reasonable for prototyping but may feel slow for production applications. For comparison, Google search typically returns results in under 0.2 seconds, while users expect sub-second responses for most applications.

The latency is primarily due to Pinecone’s free tier limitations and network round trips. Upgrading to paid tiers can reduce response times to under 100ms for better user experience.

Conclusion

In this comprehensive Pinecone vector database tutorial, we built a fully functional semantic search pipeline using:

  • Ollama embeddings for local, privacy-focused processing
  • Pinecone vector database as a scalable cloud storage solution
  • LangChain Pinecone integration for seamless orchestration

This Python vector database implementation balances privacy and scalability: Ollama embeddings run entirely offline while the Pinecone vector database handles fast vector searches in the cloud. The combination of local Ollama embeddings with cloud-based Pinecone vector database provides an optimal solution for production semantic search applications.

For complete data privacy and cost control, consider our pgvector PostgreSQL implementation which keeps everything local, including the vector database storage.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran