python Archives

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Learn ML Engineering for Free on ML Zoomcamp
Learn ML engineering for free on ML Zoomcamp and receive a certificate! Join online for practical, hands-on experience with the tech stack and workflows used in production ML. The next cohort of the course starts on September 15, 2025. Here’s what you’ll learn:
Core foundations:

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem
Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking.
Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.
Solution
Docling handles the entire workflow from raw PDFs to structured, searchable content in a single solution.
Key features:

Universal format support for PDF, DOCX, PPTX, HTML, and images
AI-powered extraction with TableFormer and Vision models
Direct export to pandas DataFrames, JSON, and Markdown
RAG-ready output maintains context and structure

📖 View Full Article

☕️ Weekly Finds

semantic-kernel
[AI Orchestration]
– Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability.

transformers
[Machine Learning]
– The model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

whisper
[Speech Recognition]
– Robust Speech Recognition via Large-Scale Weak Supervision. A multitasking model for multilingual speech recognition, translation, and language identification.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline Read More »

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Problem
Data prototyping typically requires loading entire datasets into memory first before sampling.
A 1-million-row dataset consumes 7.6 MB of memory even when you only need 10 rows for initial feature exploration, creating unnecessary resource overhead.
Solution
Use itertools.islice() to extract slices from iterators without loading full datasets into memory first.
Key benefits:

Memory-efficient data sampling
Faster prototyping workflows
Less computational load on laptops

📖 View Full Article

From pandas Full Reloads to Delta Lake Incremental Updates

Problem
Processing entire datasets when you only need to add a few new records wastes time and memory.
Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.
Solution
Delta Lake’s append mode processes only new data without touching existing records.
Key advantages:

Append new records without full dataset reload
Memory usage scales with new data size, not total dataset size
Automatic data protection prevents corruption during updates
Time travel enables rollback to previous dataset versions

Perfect for production data pipelines that need reliable incremental updates.

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

Semantic Kernel
[AI Framework]
– Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability

Ray
[Distributed Computing]
– AI compute engine with core distributed runtime and AI Libraries for accelerating ML workloads from laptop to cluster

Apache Airflow
[Workflow Orchestration]
– Platform for developing, scheduling, and monitoring workflows with powerful data pipeline orchestration capabilities

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling Read More »

Delta Lake: Transform pandas Prototypes into Production

Leave a Comment / Blog, Data Engineer / Khuyen Tran

Table of Contents

Introduction
Introduction to Delta-rs
Setup and Data Preparation
Creating Your First Delta Table
Incremental Updates and CRUD Operations
Time Travel and Data Versioning
Schema Evolution in Action
Selective Updates with Merge Operations
Multi-Engine Integration
Automatic File Cleanup
Conclusion

Introduction
Data scientists face a familiar challenge: pandas works perfectly for prototyping, but production requires enterprise features that traditional file formats can’t provide.
Delta-rs solves this by bringing Delta Lake’s ACID transactions, time travel, and schema evolution to Python without Spark dependencies. It transforms your pandas workflow into production-ready pipelines with minimal code changes.
This tutorial shows you how to build scalable data systems using Delta-rs while maintaining the simplicity that makes pandas so effective.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Introduction to Delta-rs
Delta-rs is a native Rust implementation of Delta Lake for Python. It provides enterprise-grade data lake capabilities without requiring Spark clusters or JVM setup.
Key advantages over traditional file formats:

ACID transactions ensure data consistency during concurrent operations
Time travel enables access to historical data versions
Schema evolution handles data structure changes automatically
Multi-engine support works with pandas, DuckDB, Polars, and more
Efficient updates support upserts and incremental changes without full rewrites

Setup and Data Preparation
Install Delta-rs and supporting libraries:
pip install deltalake pandas duckdb polars

We’ll use actual NYC Yellow Taxi data to demonstrate real-world scenarios. The NYC Taxi & Limousine Commission provides monthly trip records in Parquet format:
import pandas as pd
from deltalake import DeltaTable, write_deltalake
import duckdb
import polars as pl

# Download NYC Yellow Taxi data (June 2024 as example)
# Full dataset available at: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
taxi_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-06.parquet"

# Read a sample of the data for demonstration
sample_data = pd.read_parquet(taxi_url).head(10000)

print(f"Loaded {len(sample_data)} taxi trips from NYC TLC")
print(f"Data shape: {sample_data.shape}")
print(f"Date range: {sample_data['tpep_pickup_datetime'].min()} to {sample_data['tpep_pickup_datetime'].max()}")

sample_data.head()

Output:
Loaded 10000 taxi trips from NYC TLC
Data shape: (10000, 19)
Date range: 2024-05-31 15:33:34 to 2024-06-01 02:59:54
VendorID tpep_pickup_datetime … congestion_surcharge Airport_fee
0 1 2024-06-01 00:03:46 … 0.0 1.75
1 2 2024-06-01 00:55:22 … 0.0 1.75
2 1 2024-06-01 00:23:53 … 0.0 0.00
3 1 2024-06-01 00:32:24 … 2.5 0.00
4 1 2024-06-01 00:51:38 … 2.5 0.00

[5 rows x 19 columns]

Creating Your First Delta Table
Create your first Delta table in the data directory:
write_deltalake("data/taxi_delta_table", sample_data, mode="overwrite")
print("Created Delta table")

# Read back from Delta table
dt = DeltaTable("data/taxi_delta_table")
df_from_delta = dt.to_pandas()

print(f"Delta table contains {len(df_from_delta)} records")

Output:
Created Delta table
Delta table contains 10000 records

View the Delta table structure:
# Inspect Delta table metadata
print("Delta table schema:")
print(dt.schema().to_arrow())

Output:
Delta table schema:
arro3.core.Schema
————
VendorID: Int32
tpep_pickup_datetime: Timestamp(Microsecond, None)
tpep_dropoff_datetime: Timestamp(Microsecond, None)
passenger_count: Float64
trip_distance: Float64
…
total_amount: Float64
congestion_surcharge: Float64
Airport_fee: Float64

View the current version of the Delta table:
print(f"Current version: {dt.version()}")

Output:
Current version: 0
“`text
## Incremental Updates and CRUD Operations {#incremental-updates-and-crud-operations}

Instead of rewriting entire datasets when adding new records, incremental updates append only what changed. Delta-rs handles these efficient operations natively.

To demonstrate this, we'll simulate late-arriving data:

“`python
# Simulate late-arriving data
late_data = pd.read_parquet(taxi_url).iloc[10000:10050]
print(f"New data to add: {len(late_data)} records")

Output:
New data to add: 50 records

Traditional Approach: Process Everything
The pandas workflow requires loading both existing and new data, combining them, and rewriting the entire output file:
# Pandas approach – reload existing data and merge
existing_df = pd.read_parquet(taxi_url).head(10000)
complete_df = pd.concat([existing_df, late_data])
complete_df.to_parquet("data/taxi_complete.parquet")
print(f"Processed {len(complete_df)} total records")

Output:
Processed 10050 total records

Pandas processed all 10,050 records to add just 50 new ones, demonstrating the inefficiency of full-dataset operations.
Delta-rs Approach: Process Only New Data
Delta-rs appends only the new records without touching existing data:
# Delta-rs – append only what's new
write_deltalake("data/taxi_delta_table", late_data, mode="append")

dt = DeltaTable("data/taxi_delta_table")
print(f"Added {len(late_data)} new records")
print(f"Table version: {dt.version()}")

Output:
Added 50 new records
Table version: 1

Delta-rs processed only the 50 new records while automatically incrementing to version 1, enabling efficient operations and data lineage.
Time Travel and Data Versioning
Time travel and data versioning let you access any previous state of your data. This is essential for auditing changes, recovering from errors, and understanding how data evolved over time without maintaining separate backup files.
Traditional Approach: Manual Backup Strategy
Traditional file-based workflows rely on timestamped copies and manual versioning:
# Traditional pproach – manual timestamped backups
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
df.to_parquet(f"data/taxi_backup_{timestamp}.parquet") # Create manual backup
df_modified.to_parquet("data/taxi_data.parquet") # Overwrite original
# To recover: manually identify and reload backup file

Delta-rs Approach: Built-in Time Travel
Delta-rs automatically tracks every change with instant access to any version:
# Access any historical version instantly
dt_v0 = DeltaTable("data/taxi_delta_table", version=0)
current_dt = DeltaTable("data/taxi_delta_table")

print(f"Version 0: {len(dt_v0.to_pandas())} records")
print(f"Current version: {len(current_dt.to_pandas())} records")
print(f"Available versions: {current_dt.version() + 1}")

Output:
Version 0: 10000 records
Current version: 10050 records
Available versions: 2

Delta-rs maintains 2 complete versions while traditional backups would require separate 57MB files for each timestamp.

📚 For comprehensive production data workflows and version control best practices, check out Production-Ready Data Science.

Schema Evolution in Action
As requirements evolve, you often need to add new columns or change data types. Schema evolution handles these changes automatically, letting you update your data structure without breaking existing queries or reprocessing historical records.
To demonstrate this, imagine NYC’s taxi authority introduces weather tracking and surge pricing features, requiring your pipeline to handle new weather_condition and surge_multiplier columns alongside existing fare data.
# Copy the existing data
enhanced_data = pd.read_parquet(taxi_url).iloc[20000:20100].copy()

# Simulate new data with additional business columns
weather_options = ['clear', 'rain', 'snow', 'cloudy']
surge_options = [1.0, 1.2, 1.5, 2.0]
enhanced_data['weather_condition'] = [weather_options[i % 4] for i in range(len(enhanced_data))]
enhanced_data['surge_multiplier'] = [surge_options[i % 4] for i in range(len(enhanced_data))]

print(f"Enhanced data: {len(enhanced_data)} records with {len(enhanced_data.columns)} columns")
print(f"New columns: {[col for col in enhanced_data.columns if col not in sample_data.columns]}")

Output:
Enhanced data: 100 records with 21 columns
New columns: ['weather_condition', 'surge_multiplier']

Traditional Approach: No Schema History
Traditional formats provide no tracking of schema changes or evolution history:
# Traditional approach – no schema versioning or history
df_v1 = pd.read_parquet("taxi_v1.parquet") # Original schema
df_v2 = pd.read_parquet("taxi_v2.parquet") # Enhanced schema

Delta-rs Approach: Schema Versioning and History
Delta-rs automatically merges schemas while tracking every change:
# Schema evolution with automatic versioning
write_deltalake(
"data/taxi_delta_table",
enhanced_data,
mode="append",
schema_mode="merge"
)

dt = DeltaTable("data/taxi_delta_table")
print(f"Schema evolved: {len(dt.to_pandas().columns)} columns | Version: {dt.version()}")

Output:
Schema evolved: 21 columns | Version: 2

Explore the complete schema evolution history and access any previous version:
# View schema change history
history = dt.history()
for entry in history[:2]:
print(f"Version {entry['version']}: {entry['operation']} at {entry['timestamp']}")

# Access different schema versions
original_schema = DeltaTable("data/taxi_delta_table", version=0)
print(f"\nOriginal schema (v0): {len(original_schema.to_pandas().columns)} columns")
print(f"Current schema (v{dt.version()}): {len(dt.to_pandas().columns)} columns")

Output:
Version 2: WRITE at 1755180763083
Version 1: WRITE at 1755180762968

Original schema (v0): 19 columns
Current schema (v2): 21 columns

Delta-rs expanded from 19 to 21 columns across 10,150 records without schema migration scripts or pipeline failures.
Selective Updates with Merge Operations
Merge operations combine updates and inserts in a single transaction based on matching conditions. This eliminates the need to process entire datasets when you only need to modify specific records, dramatically improving efficiency at scale.
To demonstrate this, let’s create a simple taxi trips table:
# Create initial Delta table with 5 trips
trips = pd.DataFrame({
'trip_id': [1, 2, 3, 4, 5],
'fare_amount': [15.5, 20.0, 18.3, 12.5, 25.0],
'payment_type': [1, 1, 2, 1, 2]
})
write_deltalake("data/trips_merge_demo", trips, mode="overwrite")
print("Initial trips:")
print(trips)

Output:
Initial trips:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 20.0 1
2 3 18.3 2
3 4 12.5 1
4 5 25.0 2

Here are the updates we want to make:

Update trip 2: change fare from $20.00 to $22.00
Update trip 4: change fare from $12.50 to $13.80
Insert trip 6: new trip with fare $30.00
Insert trip 7: new trip with fare $16.50

Traditional Approach: Full Dataset Processing
Traditional workflows require loading complete datasets, identifying matches, and rewriting all records. This process becomes increasingly expensive as data grows:
# Traditional approach – load, modify, and rewrite everything
existing_df = trips.copy()

# Updates: manually locate and modify rows
existing_df.loc[existing_df['trip_id'] == 2, 'fare_amount'] = 22.0
existing_df.loc[existing_df['trip_id'] == 4, 'fare_amount'] = 13.8

# Inserts: create new rows and append
new_trips = pd.DataFrame({
'trip_id': [6, 7],
'fare_amount': [30.0, 16.5],
'payment_type': [1, 1]
})
updated_df = pd.concat([existing_df, new_trips], ignore_index=True)

# Rewrite entire dataset
updated_df.to_parquet("data/trips_traditional.parquet")
print(updated_df)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 1 # Updated
2 3 18.3 2
3 4 13.8 1 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs Approach: Upsert with Merge Operations
Delta-rs merge operations handle both updates and inserts in a single atomic operation, processing only affected records:
# Prepare changes: 2 updates + 2 inserts
changes = pd.DataFrame({
'trip_id': [2, 4, 6, 7],
'fare_amount': [22.0, 13.8, 30.0, 16.5],
'payment_type': [2, 2, 1, 1]
})

# Load Delta table
dt = DeltaTable("data/trips_merge_demo")

# Upsert operation: update existing, insert new
(
dt.merge(
source=changes,
predicate="target.trip_id = source.trip_id",
source_alias="source",
target_alias="target",
)
.when_matched_update(
updates={
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.when_not_matched_insert(
updates={
"trip_id": "source.trip_id",
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.execute()
)

# Verify results
result = dt.to_pandas().sort_values('trip_id').reset_index(drop=True)
print(result)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 2 # Updated
2 3 18.3 2
3 4 13.8 2 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs processed exactly 4 records (2 updates + 2 inserts) while pandas processed all 7 records. This efficiency compounds dramatically with larger datasets.
Multi-Engine Integration
Different teams often use different tools: pandas for exploration, DuckDB for SQL queries, Polars for performance. Multi-engine support lets all these tools access the same data directly without creating duplicates or writing conversion scripts.
Traditional Approach: Engine-Specific Optimization Requirements
Each engine needs different file optimizations that don’t transfer between tools:
Start with the original dataset:
# Traditional approach – Each engine needs different optimizations
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
df = pd.DataFrame(data)

The Pandas team optimizes for indexed lookups:
# Pandas team needs indexed Parquet for fast lookups
df.to_parquet("data/pandas_optimized.parquet", index=True)
pandas_result = pd.read_parquet("data/pandas_optimized.parquet")
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

The Polars team needs sorted data for predicate pushdown optimization:
# Polars team needs sorted columns for predicate pushdown
df.sort_values('payment_type').to_parquet("data/polars_optimized.parquet")
polars_result = pl.read_parquet("data/polars_optimized.parquet").select([
pl.len().alias("trips"), pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

The DuckDB team requires specific compression for query performance:
# DuckDB needs specific compression/statistics for query planning
df.to_parquet("data/duckdb_optimized.parquet", compression='zstd')
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM 'data/duckdb_optimized.parquet'
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $18.26 avg

Delta-rs Approach: Universal Optimizations
Delta-rs provides built-in optimizations that benefit all engines simultaneously:
Create one optimized Delta table that serves all engines:
# Delta-rs approach – Universal optimizations for all engines
from deltalake import write_deltalake, DeltaTable
import polars as pl
import duckdb

# Create Delta table with built-in optimizations:
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
write_deltalake("data/universal_demo", pd.DataFrame(data))

Pandas benefits from Delta’s statistics for efficient filtering:
# Pandas gets automatic optimization benefits
dt = DeltaTable("data/universal_demo")
pandas_result = dt.to_pandas()
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

Polars leverages Delta’s column statistics for predicate pushdown:
# Polars gets predicate pushdown optimization automatically
polars_result = pl.read_delta("data/universal_demo").select([
pl.len().alias("trips"),
pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Output:
Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

DuckDB uses Delta’s statistics for query planning optimization:
# DuckDB gets optimized query plans from Delta statistics
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM delta_scan('data/universal_demo')
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $17.66

One Delta table with universal optimizations benefiting all engines.
Automatic File Cleanup
Every data update creates new files while keeping old versions for time travel. Vacuum identifies files older than your retention period and safely deletes them, freeing storage space without affecting active data or recent history.
Traditional Approach: Manual Cleanup Scripts
Traditional workflows require custom scripts to manage file cleanup:
# Traditional approach – manual file management
import os
import glob
from datetime import datetime, timedelta

# Find old backup files manually
old_files = []
cutoff_date = datetime.now() – timedelta(days=7)
for file in glob.glob("data/taxi_backup_*.parquet"):
file_time = datetime.fromtimestamp(os.path.getmtime(file))
if file_time < cutoff_date:
old_files.append(file)
os.remove(file) # Manual cleanup with risk

Delta-rs Approach: Built-in Vacuum Operation
Delta-rs provides safe, automated cleanup through its vacuum() operation, which removes unused transaction files while preserving data integrity. Files become unused when:
• UPDATE operations create new versions, leaving old data files unreferenced
• DELETE operations remove data, making those files obsolete
• Failed transactions leave temporary files that were never committed
• Table optimization consolidates small files, making originals unnecessary
# Delta-rs vacuum removes unused files safely with ACID protection
from deltalake import DeltaTable
import os

def get_size(path):
"""Calculate total directory size in MB"""
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
total_size += os.path.getsize(os.path.join(dirpath, filename))
return total_size / (1024 * 1024)

With our size calculation helper in place, let’s measure storage before and after vacuum:
dt = DeltaTable("data/taxi_delta_table")

# Measure storage before cleanup
before_size = get_size("data/taxi_delta_table")

# Safe cleanup – files only deleted if no active readers/writers
dt.vacuum(retention_hours=168) # Built-in safety: won't delete files in use

# Measure storage after cleanup
after_size = get_size("data/taxi_delta_table")

print(f"Delta vacuum completed safely")
print(f"Storage before: {before_size:.1f} MB")
print(f"Storage after: {after_size:.1f} MB")
print(f"Space reclaimed: {before_size – after_size:.1f} MB")

Output:
Delta vacuum completed safely
Storage before: 8.2 MB
Storage after: 5.7 MB
Space reclaimed: 2.5 MB

Delta vacuum removed 2.5 MB of obsolete file versions, reducing storage footprint by 30% while maintaining ACID transaction guarantees and time travel capabilities.
Conclusion
Delta-rs transforms the traditional pandas workflow by providing:

Incremental updates append only changed records without full rewrites
Time travel and versioning enable recovery and auditing without manual backups
Schema evolution handles column changes without breaking queries
Merge operations combine updates and inserts in single transactions
Multi-engine support lets pandas, DuckDB, and Polars access the same data
Automatic vacuum reclaims storage by removing obsolete file versions

The bridge from pandas prototyping to production data pipelines no longer requires complex infrastructure. Delta-rs provides the reliability and performance you need while maintaining the simplicity you want.
Related Tutorials

Alternative Scaling: Scaling Pandas Workflows with PySpark’s Pandas API for Spark-based approaches
Data Versioning: Version Control for Data and Models Using DVC for broader versioning strategies
DataFrame Performance: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrame optimization techniques

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Delta Lake: Transform pandas Prototypes into Production Read More »

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Automated Chart Analysis with Hugging Face SmolVLM

Problem
Data teams spend hours manually analyzing charts and extracting insights from complex visualizations.
Manual chart analysis creates bottlenecks in decision-making workflows and reduces time available for strategic insights.
Solution
Hugging Face’s SmolVLM transforms this workflow by instantly generating insights, allowing analysts to focus on validation, strategic context, and decision-making rather than basic pattern recognition.
The complete workflow could look like this:

Automated chart interpretation using vision language models
Expert review and validation of AI findings
Strategic context addition by domain specialists

📖 View Full Article

⭐ View GitHub

Hydra Multi-run: Test All Parameters in One Command

Problem
When you run a Python script with different preprocessing strategies and hyperparameter combinations, waiting for each variation to complete before testing the next is time-consuming.
Solution
Hydra multi-run executes all parameter combinations in a single command, saving you time and effort.
Plus, Hydra offers:

YAML-based configuration management
Override parameters from the command line
Compose configs from multiple files
Environment-specific configuration switching

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

Scrapegraph-ai
[Data Extraction]
– Python scraper based on AI

Marker
[Document Processing]
– Convert PDF to markdown quickly with high accuracy

EdgeDB
[Database]
– A graph-relational database with declarative schema, built-in migration system, and a next-generation query language

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM Read More »

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem
Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.
Solution
RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.
Key benefits:

Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

📖 View Full Article

⭐ View GitHub

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Pytest for Data Scientists

2 Comments / Blog, MLOps / Khuyen Tran

As a data scientist, one way to test your Python code is by using an interactive notebook to verify the accuracy of the outputs. However, this approach does not guarantee that your code works as intended in all cases. A better approach is to identify the expected behavior of the code in various scenarios, and then verify if the code executes accordingly.

Pytest for Data Scientists Read More »

Newsletter #205: Build Debuggable Tests: One Assertion Per Function

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Python ecosystem: Jupyter, NumPy, Pandas, Matplotlib, Seaborn
ML frameworks: Scikit-learn, TensorFlow, Keras

Applied projects:

Supervised learning with CRISP-DM framework
Classification/regression with evaluation metrics
Advanced models: decision trees, ensembles, neural nets, CNNs

Production deployment:

APIs and containers: Flask, Docker, Kubernetes
Cloud solutions: AWS Lambda, TensorFlow Serving/Lite

→ Register here

📅 Today’s Picks

Ruff: Stop AI Code Complexity Before It Hits Production

Problem
AI agents often create overengineered code with multiple nested if/else and try/except blocks, increasing technical debt and making functions difficult to test.
However, it is time-consuming to check each function manually.
Solution
Ruff’s C901 complexity check automatically flags overly complex functions before they enter your codebase.
This tool counts decision points (if/else, loops) that create multiple execution paths in your code.
Key benefits:

Automatic detection of complex functions during development
Configurable complexity thresholds for your team standards
Integration with pre-commit hooks for automated validation
Clear error messages showing exact complexity scores

No more manual code reviews to catch overengineered functions.

📖 View Full Article

Build Debuggable Tests: One Assertion Per Function

Problem
Tests with multiple assertions make debugging harder.
When a test fails, you can’t tell which assertion broke without examining the code.
Solution
Create multiple specific test functions for different scenarios of the same function.
Follow these practices for focused test functions:

One assertion per test function for clear failure points
Use descriptive test names that explain the expected behavior
Maintain consistent naming patterns across your test suite

This approach makes your test suite more maintainable and failures easier to diagnose.

📖 View Full Article

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #205: Build Debuggable Tests: One Assertion Per Function Read More »

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Fuzzy Text Matching with difflib Over regex

Problem
Have you ever spent hours cleaning text data with regex, only to find that “iPhone 14 Pro Max” still doesn’t match “iPhone 14 Prro Max”?
Regex preprocessing achieves only exact matching after cleaning, failing completely with typos and character variations that exact matching cannot handle.
Solution
difflib provides similarity scoring that tolerates typos and character variations, enabling approximate matching where regex fails.
The library calculates similarity ratios between strings:

Handles typos like “Prro” vs “Pro” automatically
Returns similarity scores from 0.0 to 1.0 for ranking matches
Works with character-level variations without preprocessing
Enables fuzzy matching for real-world messy data

Perfect for product matching, name deduplication, and any scenario where exact matches aren’t realistic.

📖 View Full Article

Build Portable Python Scripts with uv PEP 723

Problem
Python scripts break when moved between environments because dependencies are scattered across requirements.txt files, virtual environments, or undocumented assumptions.
Solution
uv enables PEP 723 inline script dependencies, embedding all requirements directly in the script header for true portability.
Use uv add –script script.py dependency to automatically add metadata to any Python file.
Key benefits:

Self-contained scripts with zero external files
Easy command-line dependency management
Perfect for sharing data analysis code across teams

📖 View Full Article

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex Read More »

4 Text Similarity Tools: When Regex Isn’t Enough

2 Comments / Blog, Machine Learning, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Text Preprocessing with regex
difflib: Python’s Built-in Sequence Matching
RapidFuzz: High-Performance Fuzzy String Matching
Sentence Transformers: AI-Powered Semantic Similarity
When to Use Each Tool
Final Thoughts

Introduction
Text similarity is a fundamental challenge in data science. Whether you’re detecting duplicates, clustering content, or building search systems, the core question remains: how do you determine when different text strings represent the same concept?
Traditional exact matching fails with real-world data. Consider these common text similarity challenges:

Formatting variations: “iPhone® 14 Pro Max” vs “IPHONE 14 pro max” – identical products with different capitalization and symbols.
Missing spaces: “iPhone14ProMax” vs “iPhone 14 Pro Max” – same product name, completely different character sequences.
Extra information: “Apple iPhone 14 Pro Max 256GB” vs “iPhone 14 Pro Max” – additional details that obscure the core product.
Semantic equivalence: “wireless headphones” vs “bluetooth earbuds” – different words describing similar concepts.

These challenges require different approaches:

Regex preprocessing cleans formatting inconsistencies
difflib provides character-level similarity scoring
RapidFuzz handles fuzzy matching at scale
Sentence Transformers understands semantic relationships

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Key Takeaways
Here’s what you’ll learn:

Handle 90% of text variations with regex preprocessing and RapidFuzz matching
Achieve 5× faster fuzzy matching compared to difflib with production-grade algorithms
Unlock semantic understanding with Sentence Transformers for conceptual similarity
Navigate decision trees from simple string matching to AI-powered text analysis
Implement scalable text similarity pipelines for real-world data challenges

Text Preprocessing with regex
Raw text data contains special characters, inconsistent capitalization, and formatting variations. Regular expressions provide the first line of defense by normalizing text.
These pattern-matching tools, accessed through Python’s re module, excel at finding and replacing text patterns like symbols, whitespace, and formatting inconsistencies.
Let’s start with a realistic dataset that demonstrates common text similarity challenges:
import re

# Sample messy text data
messy_products = [
"iPhone® 14 Pro Max",
"IPHONE 14 pro max",
"Apple iPhone 14 Pro Max 256GB",
"iPhone14ProMax",
"i-Phone 14 Pro Max",
"Samsung Galaxy S23 Ultra",
"SAMSUNG Galaxy S23 Ultra 5G",
"Galaxy S23 Ultra (512GB)",
"Samsung S23 Ultra",
"wireless headphones",
"bluetooth earbuds",
"Sony WH-1000XM4 Headphones",
"WH-1000XM4 Wireless Headphones",
]

With our test data established, we can build a comprehensive preprocessing function to handle these variations:
def preprocess_product_name(text):
"""Clean product names for better similarity matching."""
# Convert to lowercase
text = text.lower()

# Normalize spaces and hyphens
text = re.sub(r"[-_]+", " ", text)
text = re.sub(r"\s+", " ", text)

# Remove size/capacity info in parentheses
text = re.sub(r"$[^)]*$", "", text)

return text.strip()

> 📖 **Related**: These regex patterns use traditional syntax for maximum compatibility. For more readable pattern construction, explore [PRegEx for human-friendly regex syntax](https://codecut.ai/pregex-write-human-readable-regular-expressions-in-python-2/).

# Apply preprocessing to sample data
print("Before and after preprocessing:")
print("-" * 50)
for product in messy_products[:8]:
cleaned = preprocess_product_name(product)
print(f"Original: {product}")
print(f"Cleaned: {cleaned}")
print()

Output:
Before and after preprocessing:
————————————————–
Original: iPhone® 14 Pro Max
Cleaned: iphone 14 pro max

Original: IPHONE 14 pro max
Cleaned: iphone 14 pro max

Original: Apple iPhone 14 Pro Max 256GB
Cleaned: apple iphone 14 pro max 256gb

Original: iPhone14ProMax
Cleaned: iphone14promax

Original: i-Phone 14 Pro Max
Cleaned: i phone 14 pro max

Original: Samsung Galaxy S23 Ultra
Cleaned: samsung galaxy s23 ultra

Original: SAMSUNG Galaxy S23 Ultra 5G
Cleaned: samsung galaxy s23 ultra 5g

Original: Galaxy S23 Ultra (512GB)
Cleaned: galaxy s23 ultra

Perfect matches emerge after cleaning formatting inconsistencies. Products 1 and 2 now match exactly, demonstrating regex’s power for standardization.
However, regex preprocessing fails with critical variations. Let’s test exact matching after preprocessing:
# Test exact matching after regex preprocessing
test_cases = [
("iPhone® 14 Pro Max", "IPHONE 14 pro max", "Case + symbols"),
("iPhone® 14 Pro Max", "Apple iPhone 14 Pro Max 256GB", "Extra words"),
("iPhone® 14 Pro Max", "iPhone14ProMax", "Missing spaces"),
("Apple iPhone 14 Pro Max", "iPhone 14 Pro Max Apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap")
]

# Test each case
for product1, product2, issue_type in test_cases:
cleaned1 = preprocess_product_name(product1)
cleaned2 = preprocess_product_name(product2)
is_match = cleaned1 == cleaned2
result = "✓" if is_match else "✗"
print(f"{result} {issue_type}: {is_match}")

Output:
✓ Case + symbols: True
✗ Extra words: False
✗ Missing spaces: False
✗ Word order: False
✗ Semantic gap: False

Regex achieves only 1/5 exact matches despite preprocessing. Success: case and symbol standardization. Failures:

Extra words: “apple iphone” vs “iphone” remain different
Missing spaces: “iphone14promax” vs “iphone 14 pro max” fail matching
Word reordering: Different arrangements of identical words don’t match
Semantic gaps: No shared text patterns between conceptually similar products

These limitations require character-level similarity measurement instead of exact matching. Python’s built-in difflib module provides the solution by analyzing character sequences and calculating similarity ratios.
difflib: Python’s Built-in Sequence Matching
difflib is a Python built-in module that provides similarity ratios. It analyzes character sequences to calculate similarity scores between text strings.
from difflib import SequenceMatcher

def calculate_similarity(text1, text2):
"""Calculate similarity ratio between two strings."""
return SequenceMatcher(None, text1, text2).ratio()

# Test difflib on key similarity challenges
test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("iphone 14 pro max", "iphone14promax", "Missing spaces"),
("iphone 14 pro max", "iphone 14 prro max", "Typo"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap")
]

for text1, text2, test_type in test_cases:
score = calculate_similarity(text1, text2)
result = "✓" if score >= 0.85 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.971
✗ Extra words: 0.739
✓ Missing spaces: 0.903
✓ Typo: 0.971
✗ Word order: 0.739
✗ Semantic gap: 0.333

difflib achieves 4/7 successful matches (≥0.85 threshold). Successes: exact matches, spacing variations, typos, and missing spaces. Failures:

Word reordering: “Apple iPhone” vs “iPhone Apple” drops to 0.739
Extra content: Additional words reduce scores to 0.739
Semantic gaps: Different words for same concept score only 0.333

These results highlight difflib’s core limitation: sensitivity to word order and poor handling of extra content. RapidFuzz tackles word reordering and extra content issues with sophisticated matching algorithms that understand token relationships beyond simple character comparison.
RapidFuzz: High-Performance Fuzzy String Matching
RapidFuzz is a high-performance fuzzy string matching library with C++ optimization. It addresses word reordering and complex text variations that difflib cannot handle effectively.
To install RapidFuzz, run:
pip install rapidfuzz

Let’s test RapidFuzz on the same test cases:
from rapidfuzz import fuzz

# Test RapidFuzz using WRatio algorithm
test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("iphone 14 pro max", "iphone14promax", "Missing spaces"),
("iphone 14 pro max", "iphone 14 prro max", "Typo"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap"),
("macbook pro", "laptop computer", "Conceptual gap")
]

for text1, text2, test_type in test_cases:
score = fuzz.WRatio(text1, text2) / 100 # Convert to 0-1 scale
result = "✓" if score >= 0.85 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.971
✓ Extra words: 0.900
✓ Missing spaces: 0.903
✓ Typo: 0.971
✓ Word order: 0.950
✗ Semantic gap: 0.389
✗ Conceptual gap: 0.385

RapidFuzz achieves 6/8 successful matches (≥0.85 threshold). Successes: exact matches, spacing, extra words, missing spaces, typos, and word order. Failures:

Semantic gaps: “wireless headphones” vs “bluetooth earbuds” scores only 0.389
Conceptual relationships: “macbook pro” vs “laptop computer” achieves just 0.385
Pattern-only matching: Cannot understand that different words describe same products

These failures reveal RapidFuzz’s fundamental limitation: it excels at text-level variations but cannot understand meaning. When products serve identical purposes using different terminology, we need semantic understanding rather than pattern matching.
Sentence Transformers addresses this gap through neural language models that comprehend conceptual relationships.
Sentence Transformers: AI-Powered Semantic Similarity
Surface-level text matching misses semantic relationships. Sentence Transformers, a library built on transformer neural networks, can understand that “wireless headphones” and “bluetooth earbuds” serve identical purposes by analyzing meaning rather than just character patterns.
To install Sentence Transformers, run:
pip install sentence-transformers

Let’s test Sentence Transformers on the same test cases:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Test semantic understanding capabilities
model = SentenceTransformer('all-MiniLM-L6-v2')

test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic match"),
("macbook pro", "laptop computer", "Conceptual match"),
("gaming console", "video game system", "Synonym match"),
("smartphone", "feature phone", "Related concepts")
]

for text1, text2, test_type in test_cases:
embeddings = model.encode([text1, text2])
score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
result = "✓" if score >= 0.65 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.867
✓ Extra words: 0.818
✓ Word order: 0.988
✗ Semantic match: 0.618
✓ Conceptual match: 0.652
✓ Synonym match: 0.651
✗ Related concepts: 0.600

Sentence Transformers achieves 7/8 successful matches (≥0.65 threshold). Successes: all text variations plus semantic relationships. Failures:

Edge case semantics: “smartphone” vs “feature phone” scores only 0.600
Processing overhead: Neural inference requires significantly more computation than string algorithms
Memory requirements: Models need substantial RAM (100MB+ for basic models, GBs for advanced ones)
Resource scaling: Large datasets may require GPU acceleration for reasonable performance

Sentence Transformers unlocks semantic understanding at computational cost. The decision depends on whether conceptual relationships provide sufficient business value to justify resource overhead.
For implementing semantic search at production scale, see our pgvector and Ollama integration guide.
When to Use Each Tool
Data Preprocessing (Always Start Here)
Use regex for:

Removing special characters and symbols
Standardizing case and formatting
Cleaning messy product names
Preparing text for similarity analysis

Character-Level Similarity
Use difflib when:

Learning text similarity concepts
Working with small datasets (<1000 records)
External dependencies not allowed
Simple typo detection is sufficient

Production Fuzzy Matching
Use RapidFuzz when:

Processing thousands of records
Need fast approximate matching
Handling abbreviations and variations
Text-level similarity is sufficient

Semantic Understanding
Use Sentence Transformers when:

Conceptual relationships matter
“wireless headphones” should match “bluetooth earbuds”
Building recommendation systems
Multilingual content similarity
Compute resources are available

Performance vs Accuracy Tradeoff

Requirement
Recommended Tool

Speed > Accuracy
RapidFuzz

Accuracy > Speed
Sentence Transformers

No Dependencies
difflib

Preprocessing Only
regex

Decision Tree
When facing a new text similarity project, use this visual guide to navigate from problem requirements to the optimal tool selection:

Final Thoughts
When facing complex challenges, start with the most basic solution first, identify where it fails through testing, then strategically upgrade the failing component. This article demonstrates exactly this progression – from simple regex preprocessing to sophisticated semantic understanding.
Build complexity incrementally based on real limitations, not anticipated ones.

📚 For comprehensive production-ready data science practices, check out Production-Ready Data Science.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #203: Semantic Search Without Complex Setup Headaches

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Semantic Search Without Complex Setup Headaches

Problem
Have you ever found yourself looking up SQL syntax when you just want to query your database?
Complex joins and subqueries create friction between you and your data insights.
Solution
The semantic search workflow connects natural language questions to your existing PostgreSQL tables.
The complete workflow includes:

Database setup with PostgreSQL and pgvector extension
Content preprocessing for optimal embeddings
Embedding pipeline using Ollama models
Vector storage with SQLAlchemy integration
Query interface for natural language searches
Response generation combining retrieval and LLMs

Query your database with plain English instead of SQL syntax.

📖 View Full Article

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #203: Semantic Search Without Complex Setup Headaches Read More »

python

Delta Lake: Transform pandas Prototypes into Production

Pytest for Data Scientists

4 Text Similarity Tools: When Regex Isn’t Enough

Drop a line

Get in touch

Follow Us on Social Media

python

Work with Khuyen Tran

Work with Khuyen Tran