python Archives

Newsletter #211: Secure Database Queries with DuckDB Parameters

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

How to write better DAGs in Airflow
DAGs (Directed Acyclic Graphs) are Airflow’s workflow definition format. They specify how data tasks connect and execute in sequence.
Well-designed DAGs handle edge cases, scale with data volume changes, and remain maintainable as your pipeline complexity grows.
What you’ll learn:

Design DAGs that are easier to read, test, and maintain
Make your pipelines adapt to your data at runtime with dynamic task mapping
Avoid common pitfalls that can cause performance issues
Create data-aware pipelines with XComs and event-driven scheduling
Learn proven DAG writing best practices including Airflow 3’s latest features

This covers practical patterns for building production-ready workflows that handle failures gracefully and scale with your data infrastructure needs.
Speakers:

Kenten Danas – Senior Manager, Developer Relations at Astronomer
Tamara Fingerlin – Developer Advocate at Astronomer

→ Register here

📅 Today’s Picks

Secure Database Queries with DuckDB Parameters

Problem
F-strings create SQL injection vulnerabilities by inserting values directly into queries.
Solution
DuckDB’s parameterized queries use placeholders to safely pass parameters and prevent SQL injection attacks.
Other key features of DuckDB:

In-Process Analytics – No external database needed
Fast Performance – Columnar storage for speed
Zero Setup – Works instantly in Python
DataFrame Integration – Native pandas support

📖 View Full Article

⭐ View GitHub

Build Semantic Text Matching with Sentence Transformers

Problem
RapidFuzz, which I introduced in my previous post, excels at lightning-fast string matching.
However, it cannot understand semantic relationships. It scores ‘running shoes’ vs ‘athletic footwear’ at only 0.267 despite describing similar product categories.
RapidFuzz compares characters, not meaning, so different words describing identical concepts get low scores.
Solution
Sentence Transformers comprehends conceptual similarity by analyzing word meanings.
Sentence Transformers follows this process:

Creates embedding vectors that represent word concepts
Similar meanings produce similar embedding patterns
Compares these concept embeddings to identify semantically similar text
Recognizes synonyms and related terminology automatically

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

tenacity
[Testing & Reliability]
– Apache 2.0 licensed general-purpose retrying library for Python to simplify adding retry behavior to just about anything

ParadeDB
[Database & Search]
– Modern Elasticsearch alternative built on Postgres for real-time, update-heavy workloads with full-text search capabilities

responses
[Testing & Mocking]
– Utility library for mocking out the Python Requests library, making it easy to test HTTP API interactions

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #211: Secure Database Queries with DuckDB Parameters Read More »

Build a Complete RAG System with 5 Open-Source Tools

Leave a Comment / Blog, LLM / Khuyen Tran

Table of Contents

Introduction to RAG Systems
Document Ingestion with MarkItDown
Intelligent Chunking with LangChain
Creating Searchable Embeddings with SentenceTransformers
Building Your Knowledge Database with ChromaDB
Enhanced Answer Generation with Open-Source LLMs
Building a Simple Application with Gradio
Conclusion

Introduction
Have you ever spent 30 minutes searching through Slack threads, email attachments, and shared drives just to find that one technical specification your colleague mentioned last week?
It is a common scenario that repeats daily across organizations worldwide. Knowledge workers spend valuable time searching for information that should be instantly accessible, leading to decreased productivity.
Retrieval-Augmented Generation (RAG) systems solve this problem by transforming your documents into an intelligent, queryable knowledge base. Ask questions in natural language and receive instant answers with source citations, eliminating time-consuming manual searches.
In this article, we’ll build a complete RAG pipeline that turns document collections into an AI-powered question-answering system.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Key Takeaways
Here’s what you’ll learn:

Convert documents with MarkItDown in 3 lines
Chunk text intelligently using LangChain RecursiveCharacterTextSplitter
Generate embeddings locally with SentenceTransformers model
Store vectors in ChromaDB persistent database
Generate answers using Ollama local LLMs
Deploy web interface with Gradio streaming

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Introduction to RAG Systems
RAG (Retrieval-Augmented Generation) combines document retrieval with language generation to create intelligent Q&A systems. Instead of relying solely on training data, RAG systems search through your documents to find relevant information, then use that context to generate accurate, source-backed responses.
Environment Setup
Install the required libraries for building your RAG pipeline:
pip install markitdown[pdf] sentence-transformers langchain-text-splitters chromadb gradio langchain-ollama ollama

These libraries provide:

markitdown: Microsoft’s document conversion tool that transforms PDFs, Word docs, and other formats into clean markdown
sentence-transformers: Local embedding generation for converting text into searchable vectors
langchain-text-splitters: Intelligent text chunking that preserves semantic meaning
chromadb: Self-hosted vector database for storing and querying document embeddings
gradio: Web interface builder for creating user-friendly Q&A applications
langchain-ollama: LangChain integration for local LLM inference

Install Ollama and download a model:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2

Next, create a project directory structure to organize your files:
mkdir processed_docs documents

These directories organize your project:

processed_docs: Stores converted markdown files
documents: Contains original source files (PDFs, Word docs, etc.)

Create these directories in your current working path with appropriate read/write permissions.
Dataset Setup: Python Technical Documentation
To demonstrate the RAG pipeline, we’ll use “Think Python” by Allen Downey, a comprehensive programming guide freely available under Creative Commons.
We’ll download the Python guide and save it in the documents directory.
import requests
from pathlib import Path

# Get the file path
output_folder = "documents"
filename = "think_python_guide.pdf"
url = "https://greenteapress.com/thinkpython/thinkpython.pdf"
file_path = Path(output_folder) / filename

def download_file(url: str, file_path: Path):
response = requests.get(url, stream=True, timeout=30)
response.raise_for_status()
file_path.write_bytes(response.content)

# Download the file if it doesn't exist
if not file_path.exists():
download_file(
url=url,
file_path=file_path,
)

Next, let’s convert this PDF into a format that our RAG system can process and search through.
Document Ingestion with MarkItDown
RAG systems need documents in a structured format that AI models can understand and process effectively.
MarkItDown solves this challenge by converting any document format into clean markdown while preserving the original structure and meaning.
Converting Your Python Guide
Start by converting the Python guide to understand how MarkItDown works:
from markitdown import MarkItDown

# Initialize the converter
md = MarkItDown()

# Convert the Python guide to markdown
result = md.convert(file_path)
python_guide_content = result.text_content

# Display the conversion results
print("First 300 characters:")
print(python_guide_content[:300] + "…")

In this code:

MarkItDown() creates a document converter that handles multiple file formats automatically
convert() processes the PDF and returns a result object containing the extracted text
text_content provides the clean markdown text ready for processing

Output:
First 300 characters:
Think Python

How to Think Like a Computer Scientist

Version 2.0.17

Think Python

How to Think Like a Computer Scientist

Version 2.0.17

Allen Downey

Green Tea Press

Needham, Massachusetts

Green Tea Press
9 Washburn Ave
Needham MA 02492

Permission is granted…

MarkItDown automatically detects the PDF format and extracts clean text while preserving the book’s structure, including chapters, sections, and code examples.
Preparing Document for Processing
Now that you understand the basic conversion, let’s prepare the document content for processing. We’ll store the guide’s content with source information for later use in chunking and retrieval:
# Organize the converted document
processed_document = {
'source': file_path,
'content': python_guide_content
}

# Create a list containing our single document for consistency with downstream processing
documents = [processed_document]

# Document is now ready for chunking and embedding
print(f"Document ready: {len(processed_document['content']):,} characters")

Output:
Document ready: 460,251 characters

With our document successfully converted to markdown, the next step is breaking it into smaller, searchable pieces.
Intelligent Chunking with LangChain
AI models can’t process entire documents due to limited context windows. Chunking breaks documents into smaller, searchable pieces while preserving semantic meaning.
Understanding Text Chunking with a Simple Example
Let’s see how text chunking works with a simple document:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a simple example that will be split
sample_text = """
Machine learning transforms data processing. It enables pattern recognition without explicit programming.

Deep learning uses neural networks with multiple layers. These networks discover complex patterns automatically.

Natural language processing combines ML with linguistics. It helps computers understand human language effectively.
"""

# Apply chunking with smaller size to demonstrate splitting
demo_splitter = RecursiveCharacterTextSplitter(
chunk_size=150, # Small size to force splitting
chunk_overlap=30,
separators=["\n\n", "\n", ". ", " ", ""], # Split hierarchy
)

sample_chunks = demo_splitter.split_text(sample_text.strip())

print(f"Original: {len(sample_text.strip())} chars → {len(sample_chunks)} chunks")

# Show chunks
for i, chunk in enumerate(sample_chunks):
print(f"Chunk {i+1}: {chunk}")

Output:
Original: 336 chars → 3 chunks
Chunk 1: Machine learning transforms data processing. It enables pattern recognition without explicit programming.
Chunk 2: Deep learning uses neural networks with multiple layers. These networks discover complex patterns automatically.
Chunk 3: Natural language processing combines ML with linguistics. It helps computers understand human language effectively.

Notice how the text splitter:

Split the 336-character text into 3 chunks, each under the 150-character limit
Applied 30-character overlap between adjacent chunks
Separators prioritize semantic boundaries: paragraphs (\n\n) → sentences (.) → words () → characters

Processing Multiple Documents at Scale
Now let’s a text splitter with larger chunks and apply it to all our converted documents:
# Configure the text splitter with Q&A-optimized settings
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=600, # Optimal chunk size for Q&A scenarios
chunk_overlap=120, # 20% overlap to preserve context
separators=["\n\n", "\n", ". ", " ", ""] # Split hierarchy
)

Next, use the text splitter to process all our documents:
def process_document(doc, text_splitter):
"""Process a single document into chunks."""
doc_chunks = text_splitter.split_text(doc["content"])
return [{"content": chunk, "source": doc["source"]} for chunk in doc_chunks]

# Process all documents and create chunks
all_chunks = []
for doc in documents:
doc_chunks = process_document(doc, text_splitter)
all_chunks.extend(doc_chunks)

Examine how the chunking process distributed content across our documents:
from collections import Counter

source_counts = Counter(chunk["source"] for chunk in all_chunks)
chunk_lengths = [len(chunk["content"]) for chunk in all_chunks]

print(f"Total chunks created: {len(all_chunks)}")
print(f"Chunk length: {min(chunk_lengths)}-{max(chunk_lengths)} characters")
print(f"Source document: {Path(documents[0]['source']).name}")

Output:
Total chunks created: 1007
Chunk length: 68-598 characters
Source document: think_python_guide.pdf

Our text chunks are ready. Next, we’ll transform them into a format that enables intelligent similarity search.
Creating Searchable Embeddings with SentenceTransformers
RAG systems need to understand text meaning, not just match keywords. SentenceTransformers converts your text into numerical vectors that capture semantic relationships, allowing the system to find truly relevant information even when exact words don’t match.
Generate Embeddings
Let’s generate embeddings for our text chunks:
from sentence_transformers import SentenceTransformer

# Load Q&A-optimized embedding model (downloads automatically on first use)
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

# Extract documents and create embeddings
documents = [chunk["content"] for chunk in all_chunks]
embeddings = model.encode(documents)

print(f"Embedding generation results:")
print(f" – Embeddings shape: {embeddings.shape}")
print(f" – Vector dimensions: {embeddings.shape[1]}")

In this code:

SentenceTransformer() loads the Q&A-optimized model that converts text to 768-dimensional vectors
multi-qa-mpnet-base-dot-v1 is specifically trained on 215M question-answer pairs for superior Q&A performance
model.encode() transforms all text chunks into numerical embeddings in a single batch operation

The output shows 1007 chunks converted to 768-dimensional vectors:
Embedding generation results:
– Embeddings shape: (1007, 768)
– Vector dimensions: 768

Test Semantic Similarity
Let’s test semantic similarity by querying for Python programming concepts:
# Test how one query finds relevant Python programming content
from sentence_transformers import util

query = "How do you define functions in Python?"
document_chunks = [
"Variables store data values that can be used later in your program.",
"A function is a block of code that performs a specific task when called.",
"Loops allow you to repeat code multiple times efficiently.",
"Functions can accept parameters and return values to the calling code."
]

# Encode query and documents
query_embedding = model.encode(query)
doc_embeddings = model.encode(document_chunks)

Now we’ll calculate similarity scores and rank the results. The util.cos_sim() function computes cosine similarity between vectors, returning values from 0 (no similarity) to 1 (identical meaning):
# Calculate similarities using SentenceTransformers util
similarities = util.cos_sim(query_embedding, doc_embeddings)[0]

# Create ranked results
ranked_results = sorted(
zip(document_chunks, similarities),
key=lambda x: x[1],
reverse=True
)

print(f"Query: '{query}'")
print("Document chunks ranked by relevance:")
for i, (chunk, score) in enumerate(ranked_results, 1):
print(f"{i}. ({score:.3f}): '{chunk}'")

Output:
Query: 'How do you define functions in Python?'
Document chunks ranked by relevance:
1. (0.674): 'A function is a block of code that performs a specific task when called.'
2. (0.607): 'Functions can accept parameters and return values to the calling code.'
3. (0.461): 'Loops allow you to repeat code multiple times efficiently.'
4. (0.448): 'Variables store data values that can be used later in your program.'

The similarity scores demonstrate semantic understanding: function-related chunks achieve high scores (0.7+) while unrelated programming concepts score much lower (0.2-).
Building Your Knowledge Database with ChromaDB
These embeddings demonstrate semantic search capability, but memory storage has scalability limitations. Large vector collections quickly exhaust system resources.
Vector databases provide essential production capabilities:

Persistent storage: Data survives system restarts and crashes
Optimized indexing: Fast similarity search using HNSW algorithms
Memory efficiency: Handles millions of vectors without RAM exhaustion
Concurrent access: Multiple users query simultaneously
Metadata filtering: Search by document properties and attributes

ChromaDB delivers these features with a Python-native API that integrates seamlessly into your existing data pipeline.
Initialize Vector Database
First, we’ll set up the ChromaDB client and create a collection to store our document vectors.
import chromadb

# Create persistent client for data storage
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection for business documents (or get existing)
collection = client.get_or_create_collection(
name="python_guide",
metadata={"description": "Python programming guide"}
)

print(f"Created collection: {collection.name}")
print(f"Collection ID: {collection.id}")

Created collection: python_guide
Collection ID: 42d23900-6c2a-47b0-8253-0a9b6dad4f41

In this code:

PersistentClient(path="./chroma_db") creates a local vector database that persists data to disk
get_or_create_collection() creates a new collection or returns an existing one with the same name

Store Documents with Metadata
Now we’ll store our document chunks with basic metadata in ChromaDB with the add() method.
# Prepare metadata and add documents to collection
metadatas = [{"document": Path(chunk["source"]).name} for chunk in all_chunks]

collection.add(
documents=documents,
embeddings=embeddings.tolist(), # Convert numpy array to list
metadatas=metadatas, # Metadata for each document
ids=[f"doc_{i}" for i in range(len(documents))], # Unique identifiers for each document
)

print(f"Collection count: {collection.count()}")

Output:
Collection count: 1007

The database now contains 1007 searchable document chunks with their vector embeddings. ChromaDB persists this data to disk, enabling instant queries without reprocessing documents on restart.
Query the Knowledge Base
Let’s search the vector database using natural language questions and retrieve relevant document chunks.
def format_query_results(question, query_embedding, documents, metadatas):
"""Format and print the search results with similarity scores"""
from sentence_transformers import util

print(f"Question: {question}\n")

for i, doc in enumerate(documents):
# Calculate accurate similarity using sentence-transformers util
doc_embedding = model.encode([doc])
similarity = util.cos_sim(query_embedding, doc_embedding)[0][0].item()
source = metadatas[i].get("document", "Unknown")

print(f"Result {i+1} (similarity: {similarity:.3f}):")
print(f"Document: {source}")
print(f"Content: {doc[:300]}…")
print()

def query_knowledge_base(question, n_results=2):
"""Query the knowledge base with natural language"""
# Encode the query using our SentenceTransformer model
query_embedding = model.encode([question])

results = collection.query(
query_embeddings=query_embedding.tolist(),
n_results=n_results,
include=["documents", "metadatas", "distances"],
)

# Extract results and format them
documents = results["documents"][0]
metadatas = results["metadatas"][0]

format_query_results(question, query_embedding, documents, metadatas)

In this code:

collection.query() performs vector similarity search using the question text as input
query_texts accepts a list of natural language questions for batch processing
n_results limits the number of most similar documents returned
include specifies which data to return: document text, metadata, and similarity distances

Let’s test the query function with a question:
query_knowledge_base("How do if-else statements work in Python?")

Output:
Question: How do if-else statements work in Python?

Result 1 (similarity: 0.636):
Document: think_python_guide.pdf
Content: 5.6 Chained conditionals

Sometimes there are more than two possibilities and we need more than two branches.
One way to express a computation like that is a chained conditional:

if x < y:
print

’

elif x > y:
’

x is less than y

’

x is greater than y

’

else:

’

x and y are equa…

Result 2 (similarity: 0.605):
Document: think_python_guide.pdf
Content: 5. An unclosed opening operator ((, {, or [) makes Python continue with the next line
as part of the current statement. Generally, an error occurs almost immediately in the
next line.

6. Check for the classic = instead of == inside a conditional.

7. Check the indentation to make sure it lines up the…

The search finds relevant content with strong similarity scores (0.636 and 0.605).
Enhanced Answer Generation with Open-Source LLMs
Vector similarity search retrieves related content, but the results may be scattered across multiple chunks without forming a complete answer.
LLMs solve this by weaving retrieved context into unified responses that directly address user questions.
In this section, we’ll integrate Ollama‘s local LLMs with our vector search to generate coherent answers from retrieved chunks.
Answer Generation Implementation
First, set up the components for LLM-powered answer generation:
from langchain_ollama import OllamaLLM
from langchain.prompts import PromptTemplate

# Initialize the local LLM
llm = OllamaLLM(model="llama3.2:latest", temperature=0.1)

Next, create a focused prompt template for technical documentation queries:
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""You are a Python programming expert. Based on the provided documentation, answer the question clearly and accurately.

Documentation:
{context}

Question: {question}

Answer (be specific about syntax, keywords, and provide examples when helpful):"""
)

# Create the processing chain
chain = prompt_template | llm

Create a function to retrieve relevant context given a question:
def retrieve_context(question, n_results=5):
"""Retrieve relevant context using embeddings"""
query_embedding = model.encode([question])
results = collection.query(
query_embeddings=query_embedding.tolist(),
n_results=n_results,
include=["documents", "metadatas", "distances"],
)

documents = results["documents"][0]
context = "\n\n—SECTION—\n\n".join(documents)
return context, documents

def get_llm_answer(question, context):
"""Generate answer using retrieved context"""
answer = chain.invoke(
{
"context": context[:2000],
"question": question,
}
)
return answer

def format_response(question, answer, source_chunks):
"""Format the final response with sources"""
response = f"**Question:** {question}\n\n"
response += f"**Answer:** {answer}\n\n"
response += "**Sources:**\n"

for i, chunk in enumerate(source_chunks[:3], 1):
preview = chunk[:100].replace("\n", " ") + "…"
response += f"{i}. {preview}\n"

return response

def enhanced_query_with_llm(question, n_results=5):
"""Query function combining retrieval with LLM generation"""
context, documents = retrieve_context(question, n_results)
answer = get_llm_answer(question, context)
return format_response(question, answer, documents)

Testing Enhanced Answer Generation
Let’s test the enhanced system with our challenging question:
# Test the enhanced query system
enhanced_response = enhanced_query_with_llm("How do if-else statements work in Python?")
print(enhanced_response)

Output:
**Question:** How do if-else statements work in Python?

**Answer:** If-else statements in Python are used for conditional execution of code. Here's a breakdown of how they work:

**Syntax**

The basic syntax of an if-else statement is as follows:
“`text
if condition:
# code to execute if condition is true
elif condition2:
# code to execute if condition1 is false and condition2 is true
else:
# code to execute if both conditions are false
“`text
**Keywords**

The keywords used in an if-else statement are:

* `if`: used to check a condition
* `elif` (short for "else if"): used to check another condition if the first one is false
* `else`: used to specify code to execute if all conditions are false

**How it works**

Here's how an if-else statement works:

1. The interpreter evaluates the condition inside the `if` block.
2. If the condition is true, the code inside the `if` block is executed.
3. If the condition is false, the interpreter moves on to the next line and checks the condition in the `elif` block.
4. If the condition in the `elif` block is true, the code inside that block is executed.
5. If both conditions are false, the interpreter executes the code inside the `else` block.

**Sources:**
1. 5.6 Chained conditionals Sometimes there are more than two possibilities and we need more than two …
2. 5. An unclosed opening operator ((, {, or [) makes Python continue with the next line as part of the c…
3. if x == y: print else: ’ x and y are equal ’ if x < y: 44 Chapter 5. Conditionals and recur…

Notice how the LLM organizes multiple chunks into logical sections with syntax examples and step-by-step explanations. This transformation turns raw retrieval into actionable programming guidance.
Streaming Interface Implementation
Users now expect the real-time streaming experience from ChatGPT and Claude. Static responses that appear all at once feel outdated and create an impression of poor performance.
Token-by-token streaming bridges this gap by creating the familiar typing effect that signals active processing.
To implement a streaming interface, we’ll use the chain.stream() method to generate tokens one at a time.
def stream_llm_answer(question, context):
"""Stream LLM answer generation token by token"""
for chunk in chain.stream({
"context": context[:2000],
"question": question,
}):
yield getattr(chunk, "content", str(chunk))

Let’s see how streaming works by combining our modular functions:
import time

# Test the streaming functionality
question = "What are Python loops?"
context, documents = retrieve_context(question, n_results=3)

print("Question:", question)
print("Answer: ", end="", flush=True)

# Stream the answer token by token
for token in stream_llm_answer(question, context):
print(token, end="", flush=True)
time.sleep(0.05) # Simulate real-time typing effect

Output:
Question: What are Python loops?
Answer: Python → loops → are → structures → that → repeat → code…

[Each token appears with typing animation]
Final: "Python loops are structures that repeat code blocks."

This creates the familiar ChatGPT-style typing animation where tokens appear progressively.
Building a Simple Application with Gradio
Now that we have a complete RAG system with enhanced answer generation, let’s make it accessible through a web interface.
Your RAG system needs an intuitive interface that non-technical users can access easily. Gradio provides this solution with:

Zero web development required: Create interfaces directly from Python functions
Automatic UI generation: Input fields and buttons generated automatically
Instant deployment: Launch web apps with a single line of code

Interface Function
Let’s create the complete Gradio interface that combines the functions we’ve built into a streaming RAG system:
import gradio as gr

def rag_interface(question):
"""Gradio interface reusing existing format_response function"""
if not question.strip():
yield "Please enter a question."
return

# Use modular retrieval and streaming
context, documents = retrieve_context(question, n_results=5)

response_start = f"**Question:** {question}\n\n**Answer:** "
answer = ""

# Stream the answer progressively
for token in stream_llm_answer(question, context):
answer += token
yield response_start + answer

# Use existing formatting function for final response
yield format_response(question, answer, documents)

Application Setup and Launch
Now, we’ll configure the Gradio web interface with sample questions and launch the application for user access.
# Create Gradio interface with streaming support
demo = gr.Interface(
fn=rag_interface,
inputs=gr.Textbox(
label="Ask a question about Python programming",
placeholder="How do if-else statements work in Python?",
lines=2,
),
outputs=gr.Markdown(label="Answer"),
title="Intelligent Document Q&A System",
description="Ask questions about Python programming concepts and get instant answers with source citations.",
examples=[
"How do if-else statements work in Python?",
"What are the different types of loops in Python?",
"How do you handle errors in Python?",
],
allow_flagging="never",
)

# Launch the interface with queue enabled for streaming
if __name__ == "__main__":
demo.queue().launch(share=True)

In this code:

gr.Interface() creates a clean web application with automatic UI generation
fn specifies the function called when users submit questions (includes streaming output)
inputs/outputs define UI components (textbox for questions, markdown for formatted answers)
examples provides clickable sample questions that demonstrate system capabilities
demo.queue().launch(share=True) enables streaming output and creates both local and public URLs

Running the application produces the following output:
* Running on local URL: http://127.0.0.1:7861
* Running on public URL: https://bb9a9fc06531d49927.gradio.live

Test the interface locally or share the public URL to demonstrate your RAG system’s capabilities.

The public URL expires in 72 hours. For persistent access, deploy to Hugging Face Spaces:
gradio deploy

You now have a complete, streaming-enabled RAG system ready for production use with real-time token generation and source citations.
Conclusion
In this article, we’ve built a complete RAG pipeline that turns your documents into an AI-powered question-answering system.
We’ve used the following tools:

MarkItDown for document conversion
LangChain for text chunking and embedding generation
ChromaDB for vector storage
Ollama for local LLM inference
Gradio for web interface

Since all of these tools are open-source, you can easily deploy this system in your own infrastructure.

📚 For comprehensive production deployment practices including configuration management, logging, and data validation, check out Production-Ready Data Science.

The best way to learn is to build, so go ahead and try it out!
Related Tutorials

Alternative Vector Database: Implement Semantic Search in Postgres Using pgvector and Ollama for PostgreSQL-based vector storage
Advanced Document Processing: Transform Any PDF into Searchable AI Data with Docling for specialized PDF parsing and RAG optimization
LangChain Fundamentals: Run Private AI Workflows with LangChain and Ollama for comprehensive LangChain and Ollama integration guide

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Build a Complete RAG System with 5 Open-Source Tools Read More »

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

This covers practical patterns for building production-ready workflows that handle failures gracefully and scale with your data infrastructure needs.
Speakers:

Kenten Danas – Senior Manager, Developer Relations at Astronomer
Tamara Fingerlin – Developer Advocate at Astronomer

→ Register here

📅 Today’s Picks

Feature Engineering Without Complex Nested Loops

Problem
Nested loops for sequence permutations create exponential complexity that becomes unmanageable as data grows.
Solution
The itertools.permutations() function automatically generates all ordered arrangements of items from your sequences.
Perfect for generating interaction features that preserve temporal or logical ordering in your feature set.

📖 View Full Article

MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Problem
Have you ever wanted to convert PDFs to text for analysis and search but find it hard to do so?
While there are many tools to convert PDFs to text, they often lose structure and readability.
Solution
Microsoft MarkItDown preserves document structure while converting PDFs to clean markdown format.
The library handles multiple file types and maintains formatting hierarchy:

Clean markdown output with preserved headers and structure
Support for PDFs, Word docs, PowerPoint, and Excel files
Simple three-line implementation for any document type
Seamless integration with existing RAG pipelines

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

scalene
[Performance & Profiling]
– A high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

bandit
[Security & Code Quality]
– A tool designed to find common security issues in Python code through static code analysis

river
[Machine Learning]
– Online machine learning in Python – enabling incremental learning algorithms for streaming data

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines Read More »

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Learn ML Engineering for Free on ML Zoomcamp
Learn ML engineering for free on ML Zoomcamp and receive a certificate! Join online for practical, hands-on experience with the tech stack and workflows used in production ML. The next cohort of the course starts on September 15, 2025. Here’s what you’ll learn:
Core foundations:

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem
Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking.
Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.
Solution
Docling handles the entire workflow from raw PDFs to structured, searchable content in a single solution.
Key features:

Universal format support for PDF, DOCX, PPTX, HTML, and images
AI-powered extraction with TableFormer and Vision models
Direct export to pandas DataFrames, JSON, and Markdown
RAG-ready output maintains context and structure

📖 View Full Article

☕️ Weekly Finds

semantic-kernel
[AI Orchestration]
– Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability.

transformers
[Machine Learning]
– The model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

whisper
[Speech Recognition]
– Robust Speech Recognition via Large-Scale Weak Supervision. A multitasking model for multilingual speech recognition, translation, and language identification.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline Read More »

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Problem
Data prototyping typically requires loading entire datasets into memory first before sampling.
A 1-million-row dataset consumes 7.6 MB of memory even when you only need 10 rows for initial feature exploration, creating unnecessary resource overhead.
Solution
Use itertools.islice() to extract slices from iterators without loading full datasets into memory first.
Key benefits:

Memory-efficient data sampling
Faster prototyping workflows
Less computational load on laptops

📖 View Full Article

From pandas Full Reloads to Delta Lake Incremental Updates

Problem
Processing entire datasets when you only need to add a few new records wastes time and memory.
Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.
Solution
Delta Lake’s append mode processes only new data without touching existing records.
Key advantages:

Append new records without full dataset reload
Memory usage scales with new data size, not total dataset size
Automatic data protection prevents corruption during updates
Time travel enables rollback to previous dataset versions

Perfect for production data pipelines that need reliable incremental updates.

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

Semantic Kernel
[AI Framework]
– Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability

Ray
[Distributed Computing]
– AI compute engine with core distributed runtime and AI Libraries for accelerating ML workloads from laptop to cluster

Apache Airflow
[Workflow Orchestration]
– Platform for developing, scheduling, and monitoring workflows with powerful data pipeline orchestration capabilities

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling Read More »

Delta Lake: Transform pandas Prototypes into Production

Leave a Comment / Blog, Data Engineer / Khuyen Tran

Table of Contents

Introduction
Introduction to Delta-rs
Setup and Data Preparation
Creating Your First Delta Table
Incremental Updates and CRUD Operations
Time Travel and Data Versioning
Schema Evolution in Action
Selective Updates with Merge Operations
Multi-Engine Integration
Automatic File Cleanup
Conclusion

Introduction
Data scientists face a familiar challenge: pandas works perfectly for prototyping, but production requires enterprise features that traditional file formats can’t provide.
Delta-rs solves this by bringing Delta Lake’s ACID transactions, time travel, and schema evolution to Python without Spark dependencies. It transforms your pandas workflow into production-ready pipelines with minimal code changes.
This tutorial shows you how to build scalable data systems using Delta-rs while maintaining the simplicity that makes pandas so effective.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Introduction to Delta-rs
Delta-rs is a native Rust implementation of Delta Lake for Python. It provides enterprise-grade data lake capabilities without requiring Spark clusters or JVM setup.
Key advantages over traditional file formats:

ACID transactions ensure data consistency during concurrent operations
Time travel enables access to historical data versions
Schema evolution handles data structure changes automatically
Multi-engine support works with pandas, DuckDB, Polars, and more
Efficient updates support upserts and incremental changes without full rewrites

Setup and Data Preparation
Install Delta-rs and supporting libraries:
pip install deltalake pandas duckdb polars

We’ll use actual NYC Yellow Taxi data to demonstrate real-world scenarios. The NYC Taxi & Limousine Commission provides monthly trip records in Parquet format:
import pandas as pd
from deltalake import DeltaTable, write_deltalake
import duckdb
import polars as pl

# Download NYC Yellow Taxi data (June 2024 as example)
# Full dataset available at: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
taxi_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-06.parquet"

# Read a sample of the data for demonstration
sample_data = pd.read_parquet(taxi_url).head(10000)

print(f"Loaded {len(sample_data)} taxi trips from NYC TLC")
print(f"Data shape: {sample_data.shape}")
print(f"Date range: {sample_data['tpep_pickup_datetime'].min()} to {sample_data['tpep_pickup_datetime'].max()}")

sample_data.head()

Output:
Loaded 10000 taxi trips from NYC TLC
Data shape: (10000, 19)
Date range: 2024-05-31 15:33:34 to 2024-06-01 02:59:54
VendorID tpep_pickup_datetime … congestion_surcharge Airport_fee
0 1 2024-06-01 00:03:46 … 0.0 1.75
1 2 2024-06-01 00:55:22 … 0.0 1.75
2 1 2024-06-01 00:23:53 … 0.0 0.00
3 1 2024-06-01 00:32:24 … 2.5 0.00
4 1 2024-06-01 00:51:38 … 2.5 0.00

[5 rows x 19 columns]

Creating Your First Delta Table
Create your first Delta table in the data directory:
write_deltalake("data/taxi_delta_table", sample_data, mode="overwrite")
print("Created Delta table")

# Read back from Delta table
dt = DeltaTable("data/taxi_delta_table")
df_from_delta = dt.to_pandas()

print(f"Delta table contains {len(df_from_delta)} records")

Output:
Created Delta table
Delta table contains 10000 records

View the Delta table structure:
# Inspect Delta table metadata
print("Delta table schema:")
print(dt.schema().to_arrow())

Output:
Delta table schema:
arro3.core.Schema
————
VendorID: Int32
tpep_pickup_datetime: Timestamp(Microsecond, None)
tpep_dropoff_datetime: Timestamp(Microsecond, None)
passenger_count: Float64
trip_distance: Float64
…
total_amount: Float64
congestion_surcharge: Float64
Airport_fee: Float64

View the current version of the Delta table:
print(f"Current version: {dt.version()}")

Output:
Current version: 0
“`text
## Incremental Updates and CRUD Operations {#incremental-updates-and-crud-operations}

Instead of rewriting entire datasets when adding new records, incremental updates append only what changed. Delta-rs handles these efficient operations natively.

To demonstrate this, we'll simulate late-arriving data:

“`python
# Simulate late-arriving data
late_data = pd.read_parquet(taxi_url).iloc[10000:10050]
print(f"New data to add: {len(late_data)} records")

Output:
New data to add: 50 records

Traditional Approach: Process Everything
The pandas workflow requires loading both existing and new data, combining them, and rewriting the entire output file:
# Pandas approach – reload existing data and merge
existing_df = pd.read_parquet(taxi_url).head(10000)
complete_df = pd.concat([existing_df, late_data])
complete_df.to_parquet("data/taxi_complete.parquet")
print(f"Processed {len(complete_df)} total records")

Output:
Processed 10050 total records

Pandas processed all 10,050 records to add just 50 new ones, demonstrating the inefficiency of full-dataset operations.
Delta-rs Approach: Process Only New Data
Delta-rs appends only the new records without touching existing data:
# Delta-rs – append only what's new
write_deltalake("data/taxi_delta_table", late_data, mode="append")

dt = DeltaTable("data/taxi_delta_table")
print(f"Added {len(late_data)} new records")
print(f"Table version: {dt.version()}")

Output:
Added 50 new records
Table version: 1

Delta-rs processed only the 50 new records while automatically incrementing to version 1, enabling efficient operations and data lineage.
Time Travel and Data Versioning
Time travel and data versioning let you access any previous state of your data. This is essential for auditing changes, recovering from errors, and understanding how data evolved over time without maintaining separate backup files.
Traditional Approach: Manual Backup Strategy
Traditional file-based workflows rely on timestamped copies and manual versioning:
# Traditional pproach – manual timestamped backups
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
df.to_parquet(f"data/taxi_backup_{timestamp}.parquet") # Create manual backup
df_modified.to_parquet("data/taxi_data.parquet") # Overwrite original
# To recover: manually identify and reload backup file

Delta-rs Approach: Built-in Time Travel
Delta-rs automatically tracks every change with instant access to any version:
# Access any historical version instantly
dt_v0 = DeltaTable("data/taxi_delta_table", version=0)
current_dt = DeltaTable("data/taxi_delta_table")

print(f"Version 0: {len(dt_v0.to_pandas())} records")
print(f"Current version: {len(current_dt.to_pandas())} records")
print(f"Available versions: {current_dt.version() + 1}")

Output:
Version 0: 10000 records
Current version: 10050 records
Available versions: 2

Delta-rs maintains 2 complete versions while traditional backups would require separate 57MB files for each timestamp.

📚 For comprehensive production data workflows and version control best practices, check out Production-Ready Data Science.

Schema Evolution in Action
As requirements evolve, you often need to add new columns or change data types. Schema evolution handles these changes automatically, letting you update your data structure without breaking existing queries or reprocessing historical records.
To demonstrate this, imagine NYC’s taxi authority introduces weather tracking and surge pricing features, requiring your pipeline to handle new weather_condition and surge_multiplier columns alongside existing fare data.
# Copy the existing data
enhanced_data = pd.read_parquet(taxi_url).iloc[20000:20100].copy()

# Simulate new data with additional business columns
weather_options = ['clear', 'rain', 'snow', 'cloudy']
surge_options = [1.0, 1.2, 1.5, 2.0]
enhanced_data['weather_condition'] = [weather_options[i % 4] for i in range(len(enhanced_data))]
enhanced_data['surge_multiplier'] = [surge_options[i % 4] for i in range(len(enhanced_data))]

print(f"Enhanced data: {len(enhanced_data)} records with {len(enhanced_data.columns)} columns")
print(f"New columns: {[col for col in enhanced_data.columns if col not in sample_data.columns]}")

Output:
Enhanced data: 100 records with 21 columns
New columns: ['weather_condition', 'surge_multiplier']

Traditional Approach: No Schema History
Traditional formats provide no tracking of schema changes or evolution history:
# Traditional approach – no schema versioning or history
df_v1 = pd.read_parquet("taxi_v1.parquet") # Original schema
df_v2 = pd.read_parquet("taxi_v2.parquet") # Enhanced schema

Delta-rs Approach: Schema Versioning and History
Delta-rs automatically merges schemas while tracking every change:
# Schema evolution with automatic versioning
write_deltalake(
"data/taxi_delta_table",
enhanced_data,
mode="append",
schema_mode="merge"
)

dt = DeltaTable("data/taxi_delta_table")
print(f"Schema evolved: {len(dt.to_pandas().columns)} columns | Version: {dt.version()}")

Output:
Schema evolved: 21 columns | Version: 2

Explore the complete schema evolution history and access any previous version:
# View schema change history
history = dt.history()
for entry in history[:2]:
print(f"Version {entry['version']}: {entry['operation']} at {entry['timestamp']}")

# Access different schema versions
original_schema = DeltaTable("data/taxi_delta_table", version=0)
print(f"\nOriginal schema (v0): {len(original_schema.to_pandas().columns)} columns")
print(f"Current schema (v{dt.version()}): {len(dt.to_pandas().columns)} columns")

Output:
Version 2: WRITE at 1755180763083
Version 1: WRITE at 1755180762968

Original schema (v0): 19 columns
Current schema (v2): 21 columns

Delta-rs expanded from 19 to 21 columns across 10,150 records without schema migration scripts or pipeline failures.
Selective Updates with Merge Operations
Merge operations combine updates and inserts in a single transaction based on matching conditions. This eliminates the need to process entire datasets when you only need to modify specific records, dramatically improving efficiency at scale.
To demonstrate this, let’s create a simple taxi trips table:
# Create initial Delta table with 5 trips
trips = pd.DataFrame({
'trip_id': [1, 2, 3, 4, 5],
'fare_amount': [15.5, 20.0, 18.3, 12.5, 25.0],
'payment_type': [1, 1, 2, 1, 2]
})
write_deltalake("data/trips_merge_demo", trips, mode="overwrite")
print("Initial trips:")
print(trips)

Output:
Initial trips:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 20.0 1
2 3 18.3 2
3 4 12.5 1
4 5 25.0 2

Here are the updates we want to make:

Update trip 2: change fare from $20.00 to $22.00
Update trip 4: change fare from $12.50 to $13.80
Insert trip 6: new trip with fare $30.00
Insert trip 7: new trip with fare $16.50

Traditional Approach: Full Dataset Processing
Traditional workflows require loading complete datasets, identifying matches, and rewriting all records. This process becomes increasingly expensive as data grows:
# Traditional approach – load, modify, and rewrite everything
existing_df = trips.copy()

# Updates: manually locate and modify rows
existing_df.loc[existing_df['trip_id'] == 2, 'fare_amount'] = 22.0
existing_df.loc[existing_df['trip_id'] == 4, 'fare_amount'] = 13.8

# Inserts: create new rows and append
new_trips = pd.DataFrame({
'trip_id': [6, 7],
'fare_amount': [30.0, 16.5],
'payment_type': [1, 1]
})
updated_df = pd.concat([existing_df, new_trips], ignore_index=True)

# Rewrite entire dataset
updated_df.to_parquet("data/trips_traditional.parquet")
print(updated_df)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 1 # Updated
2 3 18.3 2
3 4 13.8 1 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs Approach: Upsert with Merge Operations
Delta-rs merge operations handle both updates and inserts in a single atomic operation, processing only affected records:
# Prepare changes: 2 updates + 2 inserts
changes = pd.DataFrame({
'trip_id': [2, 4, 6, 7],
'fare_amount': [22.0, 13.8, 30.0, 16.5],
'payment_type': [2, 2, 1, 1]
})

# Load Delta table
dt = DeltaTable("data/trips_merge_demo")

# Upsert operation: update existing, insert new
(
dt.merge(
source=changes,
predicate="target.trip_id = source.trip_id",
source_alias="source",
target_alias="target",
)
.when_matched_update(
updates={
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.when_not_matched_insert(
updates={
"trip_id": "source.trip_id",
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.execute()
)

# Verify results
result = dt.to_pandas().sort_values('trip_id').reset_index(drop=True)
print(result)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 2 # Updated
2 3 18.3 2
3 4 13.8 2 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs processed exactly 4 records (2 updates + 2 inserts) while pandas processed all 7 records. This efficiency compounds dramatically with larger datasets.
Multi-Engine Integration
Different teams often use different tools: pandas for exploration, DuckDB for SQL queries, Polars for performance. Multi-engine support lets all these tools access the same data directly without creating duplicates or writing conversion scripts.
Traditional Approach: Engine-Specific Optimization Requirements
Each engine needs different file optimizations that don’t transfer between tools:
Start with the original dataset:
# Traditional approach – Each engine needs different optimizations
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
df = pd.DataFrame(data)

The Pandas team optimizes for indexed lookups:
# Pandas team needs indexed Parquet for fast lookups
df.to_parquet("data/pandas_optimized.parquet", index=True)
pandas_result = pd.read_parquet("data/pandas_optimized.parquet")
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

The Polars team needs sorted data for predicate pushdown optimization:
# Polars team needs sorted columns for predicate pushdown
df.sort_values('payment_type').to_parquet("data/polars_optimized.parquet")
polars_result = pl.read_parquet("data/polars_optimized.parquet").select([
pl.len().alias("trips"), pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

The DuckDB team requires specific compression for query performance:
# DuckDB needs specific compression/statistics for query planning
df.to_parquet("data/duckdb_optimized.parquet", compression='zstd')
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM 'data/duckdb_optimized.parquet'
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $18.26 avg

Delta-rs Approach: Universal Optimizations
Delta-rs provides built-in optimizations that benefit all engines simultaneously:
Create one optimized Delta table that serves all engines:
# Delta-rs approach – Universal optimizations for all engines
from deltalake import write_deltalake, DeltaTable
import polars as pl
import duckdb

# Create Delta table with built-in optimizations:
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
write_deltalake("data/universal_demo", pd.DataFrame(data))

Pandas benefits from Delta’s statistics for efficient filtering:
# Pandas gets automatic optimization benefits
dt = DeltaTable("data/universal_demo")
pandas_result = dt.to_pandas()
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

Polars leverages Delta’s column statistics for predicate pushdown:
# Polars gets predicate pushdown optimization automatically
polars_result = pl.read_delta("data/universal_demo").select([
pl.len().alias("trips"),
pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Output:
Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

DuckDB uses Delta’s statistics for query planning optimization:
# DuckDB gets optimized query plans from Delta statistics
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM delta_scan('data/universal_demo')
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $17.66

One Delta table with universal optimizations benefiting all engines.
Automatic File Cleanup
Every data update creates new files while keeping old versions for time travel. Vacuum identifies files older than your retention period and safely deletes them, freeing storage space without affecting active data or recent history.
Traditional Approach: Manual Cleanup Scripts
Traditional workflows require custom scripts to manage file cleanup:
# Traditional approach – manual file management
import os
import glob
from datetime import datetime, timedelta

# Find old backup files manually
old_files = []
cutoff_date = datetime.now() – timedelta(days=7)
for file in glob.glob("data/taxi_backup_*.parquet"):
file_time = datetime.fromtimestamp(os.path.getmtime(file))
if file_time < cutoff_date:
old_files.append(file)
os.remove(file) # Manual cleanup with risk

Delta-rs Approach: Built-in Vacuum Operation
Delta-rs provides safe, automated cleanup through its vacuum() operation, which removes unused transaction files while preserving data integrity. Files become unused when:
• UPDATE operations create new versions, leaving old data files unreferenced
• DELETE operations remove data, making those files obsolete
• Failed transactions leave temporary files that were never committed
• Table optimization consolidates small files, making originals unnecessary
# Delta-rs vacuum removes unused files safely with ACID protection
from deltalake import DeltaTable
import os

def get_size(path):
"""Calculate total directory size in MB"""
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
total_size += os.path.getsize(os.path.join(dirpath, filename))
return total_size / (1024 * 1024)

With our size calculation helper in place, let’s measure storage before and after vacuum:
dt = DeltaTable("data/taxi_delta_table")

# Measure storage before cleanup
before_size = get_size("data/taxi_delta_table")

# Safe cleanup – files only deleted if no active readers/writers
dt.vacuum(retention_hours=168) # Built-in safety: won't delete files in use

# Measure storage after cleanup
after_size = get_size("data/taxi_delta_table")

print(f"Delta vacuum completed safely")
print(f"Storage before: {before_size:.1f} MB")
print(f"Storage after: {after_size:.1f} MB")
print(f"Space reclaimed: {before_size – after_size:.1f} MB")

Output:
Delta vacuum completed safely
Storage before: 8.2 MB
Storage after: 5.7 MB
Space reclaimed: 2.5 MB

Delta vacuum removed 2.5 MB of obsolete file versions, reducing storage footprint by 30% while maintaining ACID transaction guarantees and time travel capabilities.
Conclusion
Delta-rs transforms the traditional pandas workflow by providing:

Incremental updates append only changed records without full rewrites
Time travel and versioning enable recovery and auditing without manual backups
Schema evolution handles column changes without breaking queries
Merge operations combine updates and inserts in single transactions
Multi-engine support lets pandas, DuckDB, and Polars access the same data
Automatic vacuum reclaims storage by removing obsolete file versions

The bridge from pandas prototyping to production data pipelines no longer requires complex infrastructure. Delta-rs provides the reliability and performance you need while maintaining the simplicity you want.
Related Tutorials

Alternative Scaling: Scaling Pandas Workflows with PySpark’s Pandas API for Spark-based approaches
Data Versioning: Version Control for Data and Models Using DVC for broader versioning strategies
DataFrame Performance: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrame optimization techniques

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Delta Lake: Transform pandas Prototypes into Production Read More »

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Automated Chart Analysis with Hugging Face SmolVLM

Problem
Data teams spend hours manually analyzing charts and extracting insights from complex visualizations.
Manual chart analysis creates bottlenecks in decision-making workflows and reduces time available for strategic insights.
Solution
Hugging Face’s SmolVLM transforms this workflow by instantly generating insights, allowing analysts to focus on validation, strategic context, and decision-making rather than basic pattern recognition.
The complete workflow could look like this:

Automated chart interpretation using vision language models
Expert review and validation of AI findings
Strategic context addition by domain specialists

📖 View Full Article

⭐ View GitHub

Hydra Multi-run: Test All Parameters in One Command

Problem
When you run a Python script with different preprocessing strategies and hyperparameter combinations, waiting for each variation to complete before testing the next is time-consuming.
Solution
Hydra multi-run executes all parameter combinations in a single command, saving you time and effort.
Plus, Hydra offers:

YAML-based configuration management
Override parameters from the command line
Compose configs from multiple files
Environment-specific configuration switching

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

Scrapegraph-ai
[Data Extraction]
– Python scraper based on AI

Marker
[Document Processing]
– Convert PDF to markdown quickly with high accuracy

EdgeDB
[Database]
– A graph-relational database with declarative schema, built-in migration system, and a next-generation query language

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM Read More »

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem
Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.
Solution
RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.
Key benefits:

Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

📖 View Full Article

⭐ View GitHub

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Pytest for Data Scientists

2 Comments / Blog, MLOps / Khuyen Tran

As a data scientist, one way to test your Python code is by using an interactive notebook to verify the accuracy of the outputs. However, this approach does not guarantee that your code works as intended in all cases. A better approach is to identify the expected behavior of the code in various scenarios, and then verify if the code executes accordingly.

Pytest for Data Scientists Read More »

Newsletter #205: Build Debuggable Tests: One Assertion Per Function

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Ruff: Stop AI Code Complexity Before It Hits Production

Problem
AI agents often create overengineered code with multiple nested if/else and try/except blocks, increasing technical debt and making functions difficult to test.
However, it is time-consuming to check each function manually.
Solution
Ruff’s C901 complexity check automatically flags overly complex functions before they enter your codebase.
This tool counts decision points (if/else, loops) that create multiple execution paths in your code.
Key benefits:

Automatic detection of complex functions during development
Configurable complexity thresholds for your team standards
Integration with pre-commit hooks for automated validation
Clear error messages showing exact complexity scores

No more manual code reviews to catch overengineered functions.

📖 View Full Article

Build Debuggable Tests: One Assertion Per Function

Problem
Tests with multiple assertions make debugging harder.
When a test fails, you can’t tell which assertion broke without examining the code.
Solution
Create multiple specific test functions for different scenarios of the same function.
Follow these practices for focused test functions:

One assertion per test function for clear failure points
Use descriptive test names that explain the expected behavior
Maintain consistent naming patterns across your test suite

This approach makes your test suite more maintainable and failures easier to diagnose.

📖 View Full Article

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #205: Build Debuggable Tests: One Assertion Per Function Read More »

python

Build a Complete RAG System with 5 Open-Source Tools

Delta Lake: Transform pandas Prototypes into Production

Pytest for Data Scientists

Drop a line

Get in touch

Follow Us on Social Media

python

Work with Khuyen Tran

Work with Khuyen Tran