Pandas

Delta Lake: Transform pandas Prototypes into Production

Leave a Comment / Blog, Data Engineer / Khuyen Tran

Table of Contents

Introduction
Introduction to Delta-rs
Setup and Data Preparation
Creating Your First Delta Table
Incremental Updates and CRUD Operations
Time Travel and Data Versioning
Schema Evolution in Action
Selective Updates with Merge Operations
Multi-Engine Integration
Automatic File Cleanup
Conclusion

Introduction
Data scientists face a familiar challenge: pandas works perfectly for prototyping, but production requires enterprise features that traditional file formats can’t provide.
Delta-rs solves this by bringing Delta Lake’s ACID transactions, time travel, and schema evolution to Python without Spark dependencies. It transforms your pandas workflow into production-ready pipelines with minimal code changes.
This tutorial shows you how to build scalable data systems using Delta-rs while maintaining the simplicity that makes pandas so effective.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Introduction to Delta-rs
Delta-rs is a native Rust implementation of Delta Lake for Python. It provides enterprise-grade data lake capabilities without requiring Spark clusters or JVM setup.
Key advantages over traditional file formats:

ACID transactions ensure data consistency during concurrent operations
Time travel enables access to historical data versions
Schema evolution handles data structure changes automatically
Multi-engine support works with pandas, DuckDB, Polars, and more
Efficient updates support upserts and incremental changes without full rewrites

Setup and Data Preparation
Install Delta-rs and supporting libraries:
pip install deltalake pandas duckdb polars

We’ll use actual NYC Yellow Taxi data to demonstrate real-world scenarios. The NYC Taxi & Limousine Commission provides monthly trip records in Parquet format:
import pandas as pd
from deltalake import DeltaTable, write_deltalake
import duckdb
import polars as pl

# Download NYC Yellow Taxi data (June 2024 as example)
# Full dataset available at: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
taxi_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-06.parquet"

# Read a sample of the data for demonstration
sample_data = pd.read_parquet(taxi_url).head(10000)

print(f"Loaded {len(sample_data)} taxi trips from NYC TLC")
print(f"Data shape: {sample_data.shape}")
print(f"Date range: {sample_data['tpep_pickup_datetime'].min()} to {sample_data['tpep_pickup_datetime'].max()}")

sample_data.head()

Output:
Loaded 10000 taxi trips from NYC TLC
Data shape: (10000, 19)
Date range: 2024-05-31 15:33:34 to 2024-06-01 02:59:54
VendorID tpep_pickup_datetime … congestion_surcharge Airport_fee
0 1 2024-06-01 00:03:46 … 0.0 1.75
1 2 2024-06-01 00:55:22 … 0.0 1.75
2 1 2024-06-01 00:23:53 … 0.0 0.00
3 1 2024-06-01 00:32:24 … 2.5 0.00
4 1 2024-06-01 00:51:38 … 2.5 0.00

[5 rows x 19 columns]

Creating Your First Delta Table
Create your first Delta table in the data directory:
write_deltalake("data/taxi_delta_table", sample_data, mode="overwrite")
print("Created Delta table")

# Read back from Delta table
dt = DeltaTable("data/taxi_delta_table")
df_from_delta = dt.to_pandas()

print(f"Delta table contains {len(df_from_delta)} records")

Output:
Created Delta table
Delta table contains 10000 records

View the Delta table structure:
# Inspect Delta table metadata
print("Delta table schema:")
print(dt.schema().to_arrow())

Output:
Delta table schema:
arro3.core.Schema
————
VendorID: Int32
tpep_pickup_datetime: Timestamp(Microsecond, None)
tpep_dropoff_datetime: Timestamp(Microsecond, None)
passenger_count: Float64
trip_distance: Float64
…
total_amount: Float64
congestion_surcharge: Float64
Airport_fee: Float64

View the current version of the Delta table:
print(f"Current version: {dt.version()}")

Output:
Current version: 0
“`text
## Incremental Updates and CRUD Operations {#incremental-updates-and-crud-operations}

Instead of rewriting entire datasets when adding new records, incremental updates append only what changed. Delta-rs handles these efficient operations natively.

To demonstrate this, we'll simulate late-arriving data:

“`python
# Simulate late-arriving data
late_data = pd.read_parquet(taxi_url).iloc[10000:10050]
print(f"New data to add: {len(late_data)} records")

Output:
New data to add: 50 records

Traditional Approach: Process Everything
The pandas workflow requires loading both existing and new data, combining them, and rewriting the entire output file:
# Pandas approach – reload existing data and merge
existing_df = pd.read_parquet(taxi_url).head(10000)
complete_df = pd.concat([existing_df, late_data])
complete_df.to_parquet("data/taxi_complete.parquet")
print(f"Processed {len(complete_df)} total records")

Output:
Processed 10050 total records

Pandas processed all 10,050 records to add just 50 new ones, demonstrating the inefficiency of full-dataset operations.
Delta-rs Approach: Process Only New Data
Delta-rs appends only the new records without touching existing data:
# Delta-rs – append only what's new
write_deltalake("data/taxi_delta_table", late_data, mode="append")

dt = DeltaTable("data/taxi_delta_table")
print(f"Added {len(late_data)} new records")
print(f"Table version: {dt.version()}")

Output:
Added 50 new records
Table version: 1

Delta-rs processed only the 50 new records while automatically incrementing to version 1, enabling efficient operations and data lineage.
Time Travel and Data Versioning
Time travel and data versioning let you access any previous state of your data. This is essential for auditing changes, recovering from errors, and understanding how data evolved over time without maintaining separate backup files.
Traditional Approach: Manual Backup Strategy
Traditional file-based workflows rely on timestamped copies and manual versioning:
# Traditional pproach – manual timestamped backups
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
df.to_parquet(f"data/taxi_backup_{timestamp}.parquet") # Create manual backup
df_modified.to_parquet("data/taxi_data.parquet") # Overwrite original
# To recover: manually identify and reload backup file

Delta-rs Approach: Built-in Time Travel
Delta-rs automatically tracks every change with instant access to any version:
# Access any historical version instantly
dt_v0 = DeltaTable("data/taxi_delta_table", version=0)
current_dt = DeltaTable("data/taxi_delta_table")

print(f"Version 0: {len(dt_v0.to_pandas())} records")
print(f"Current version: {len(current_dt.to_pandas())} records")
print(f"Available versions: {current_dt.version() + 1}")

Output:
Version 0: 10000 records
Current version: 10050 records
Available versions: 2

Delta-rs maintains 2 complete versions while traditional backups would require separate 57MB files for each timestamp.

📚 For comprehensive production data workflows and version control best practices, check out Production-Ready Data Science.

Schema Evolution in Action
As requirements evolve, you often need to add new columns or change data types. Schema evolution handles these changes automatically, letting you update your data structure without breaking existing queries or reprocessing historical records.
To demonstrate this, imagine NYC’s taxi authority introduces weather tracking and surge pricing features, requiring your pipeline to handle new weather_condition and surge_multiplier columns alongside existing fare data.
# Copy the existing data
enhanced_data = pd.read_parquet(taxi_url).iloc[20000:20100].copy()

# Simulate new data with additional business columns
weather_options = ['clear', 'rain', 'snow', 'cloudy']
surge_options = [1.0, 1.2, 1.5, 2.0]
enhanced_data['weather_condition'] = [weather_options[i % 4] for i in range(len(enhanced_data))]
enhanced_data['surge_multiplier'] = [surge_options[i % 4] for i in range(len(enhanced_data))]

print(f"Enhanced data: {len(enhanced_data)} records with {len(enhanced_data.columns)} columns")
print(f"New columns: {[col for col in enhanced_data.columns if col not in sample_data.columns]}")

Output:
Enhanced data: 100 records with 21 columns
New columns: ['weather_condition', 'surge_multiplier']

Traditional Approach: No Schema History
Traditional formats provide no tracking of schema changes or evolution history:
# Traditional approach – no schema versioning or history
df_v1 = pd.read_parquet("taxi_v1.parquet") # Original schema
df_v2 = pd.read_parquet("taxi_v2.parquet") # Enhanced schema

Delta-rs Approach: Schema Versioning and History
Delta-rs automatically merges schemas while tracking every change:
# Schema evolution with automatic versioning
write_deltalake(
"data/taxi_delta_table",
enhanced_data,
mode="append",
schema_mode="merge"
)

dt = DeltaTable("data/taxi_delta_table")
print(f"Schema evolved: {len(dt.to_pandas().columns)} columns | Version: {dt.version()}")

Output:
Schema evolved: 21 columns | Version: 2

Explore the complete schema evolution history and access any previous version:
# View schema change history
history = dt.history()
for entry in history[:2]:
print(f"Version {entry['version']}: {entry['operation']} at {entry['timestamp']}")

# Access different schema versions
original_schema = DeltaTable("data/taxi_delta_table", version=0)
print(f"\nOriginal schema (v0): {len(original_schema.to_pandas().columns)} columns")
print(f"Current schema (v{dt.version()}): {len(dt.to_pandas().columns)} columns")

Output:
Version 2: WRITE at 1755180763083
Version 1: WRITE at 1755180762968

Original schema (v0): 19 columns
Current schema (v2): 21 columns

Delta-rs expanded from 19 to 21 columns across 10,150 records without schema migration scripts or pipeline failures.
Selective Updates with Merge Operations
Merge operations combine updates and inserts in a single transaction based on matching conditions. This eliminates the need to process entire datasets when you only need to modify specific records, dramatically improving efficiency at scale.
To demonstrate this, let’s create a simple taxi trips table:
# Create initial Delta table with 5 trips
trips = pd.DataFrame({
'trip_id': [1, 2, 3, 4, 5],
'fare_amount': [15.5, 20.0, 18.3, 12.5, 25.0],
'payment_type': [1, 1, 2, 1, 2]
})
write_deltalake("data/trips_merge_demo", trips, mode="overwrite")
print("Initial trips:")
print(trips)

Output:
Initial trips:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 20.0 1
2 3 18.3 2
3 4 12.5 1
4 5 25.0 2

Here are the updates we want to make:

Update trip 2: change fare from $20.00 to $22.00
Update trip 4: change fare from $12.50 to $13.80
Insert trip 6: new trip with fare $30.00
Insert trip 7: new trip with fare $16.50

Traditional Approach: Full Dataset Processing
Traditional workflows require loading complete datasets, identifying matches, and rewriting all records. This process becomes increasingly expensive as data grows:
# Traditional approach – load, modify, and rewrite everything
existing_df = trips.copy()

# Updates: manually locate and modify rows
existing_df.loc[existing_df['trip_id'] == 2, 'fare_amount'] = 22.0
existing_df.loc[existing_df['trip_id'] == 4, 'fare_amount'] = 13.8

# Inserts: create new rows and append
new_trips = pd.DataFrame({
'trip_id': [6, 7],
'fare_amount': [30.0, 16.5],
'payment_type': [1, 1]
})
updated_df = pd.concat([existing_df, new_trips], ignore_index=True)

# Rewrite entire dataset
updated_df.to_parquet("data/trips_traditional.parquet")
print(updated_df)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 1 # Updated
2 3 18.3 2
3 4 13.8 1 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs Approach: Upsert with Merge Operations
Delta-rs merge operations handle both updates and inserts in a single atomic operation, processing only affected records:
# Prepare changes: 2 updates + 2 inserts
changes = pd.DataFrame({
'trip_id': [2, 4, 6, 7],
'fare_amount': [22.0, 13.8, 30.0, 16.5],
'payment_type': [2, 2, 1, 1]
})

# Load Delta table
dt = DeltaTable("data/trips_merge_demo")

# Upsert operation: update existing, insert new
(
dt.merge(
source=changes,
predicate="target.trip_id = source.trip_id",
source_alias="source",
target_alias="target",
)
.when_matched_update(
updates={
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.when_not_matched_insert(
updates={
"trip_id": "source.trip_id",
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.execute()
)

# Verify results
result = dt.to_pandas().sort_values('trip_id').reset_index(drop=True)
print(result)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 2 # Updated
2 3 18.3 2
3 4 13.8 2 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs processed exactly 4 records (2 updates + 2 inserts) while pandas processed all 7 records. This efficiency compounds dramatically with larger datasets.
Multi-Engine Integration
Different teams often use different tools: pandas for exploration, DuckDB for SQL queries, Polars for performance. Multi-engine support lets all these tools access the same data directly without creating duplicates or writing conversion scripts.
Traditional Approach: Engine-Specific Optimization Requirements
Each engine needs different file optimizations that don’t transfer between tools:
Start with the original dataset:
# Traditional approach – Each engine needs different optimizations
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
df = pd.DataFrame(data)

The Pandas team optimizes for indexed lookups:
# Pandas team needs indexed Parquet for fast lookups
df.to_parquet("data/pandas_optimized.parquet", index=True)
pandas_result = pd.read_parquet("data/pandas_optimized.parquet")
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

The Polars team needs sorted data for predicate pushdown optimization:
# Polars team needs sorted columns for predicate pushdown
df.sort_values('payment_type').to_parquet("data/polars_optimized.parquet")
polars_result = pl.read_parquet("data/polars_optimized.parquet").select([
pl.len().alias("trips"), pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

The DuckDB team requires specific compression for query performance:
# DuckDB needs specific compression/statistics for query planning
df.to_parquet("data/duckdb_optimized.parquet", compression='zstd')
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM 'data/duckdb_optimized.parquet'
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $18.26 avg

Delta-rs Approach: Universal Optimizations
Delta-rs provides built-in optimizations that benefit all engines simultaneously:
Create one optimized Delta table that serves all engines:
# Delta-rs approach – Universal optimizations for all engines
from deltalake import write_deltalake, DeltaTable
import polars as pl
import duckdb

# Create Delta table with built-in optimizations:
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
write_deltalake("data/universal_demo", pd.DataFrame(data))

Pandas benefits from Delta’s statistics for efficient filtering:
# Pandas gets automatic optimization benefits
dt = DeltaTable("data/universal_demo")
pandas_result = dt.to_pandas()
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

Polars leverages Delta’s column statistics for predicate pushdown:
# Polars gets predicate pushdown optimization automatically
polars_result = pl.read_delta("data/universal_demo").select([
pl.len().alias("trips"),
pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Output:
Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

DuckDB uses Delta’s statistics for query planning optimization:
# DuckDB gets optimized query plans from Delta statistics
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM delta_scan('data/universal_demo')
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $17.66

One Delta table with universal optimizations benefiting all engines.
Automatic File Cleanup
Every data update creates new files while keeping old versions for time travel. Vacuum identifies files older than your retention period and safely deletes them, freeing storage space without affecting active data or recent history.
Traditional Approach: Manual Cleanup Scripts
Traditional workflows require custom scripts to manage file cleanup:
# Traditional approach – manual file management
import os
import glob
from datetime import datetime, timedelta

# Find old backup files manually
old_files = []
cutoff_date = datetime.now() – timedelta(days=7)
for file in glob.glob("data/taxi_backup_*.parquet"):
file_time = datetime.fromtimestamp(os.path.getmtime(file))
if file_time < cutoff_date:
old_files.append(file)
os.remove(file) # Manual cleanup with risk

Delta-rs Approach: Built-in Vacuum Operation
Delta-rs provides safe, automated cleanup through its vacuum() operation, which removes unused transaction files while preserving data integrity. Files become unused when:
• UPDATE operations create new versions, leaving old data files unreferenced
• DELETE operations remove data, making those files obsolete
• Failed transactions leave temporary files that were never committed
• Table optimization consolidates small files, making originals unnecessary
# Delta-rs vacuum removes unused files safely with ACID protection
from deltalake import DeltaTable
import os

def get_size(path):
"""Calculate total directory size in MB"""
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
total_size += os.path.getsize(os.path.join(dirpath, filename))
return total_size / (1024 * 1024)

With our size calculation helper in place, let’s measure storage before and after vacuum:
dt = DeltaTable("data/taxi_delta_table")

# Measure storage before cleanup
before_size = get_size("data/taxi_delta_table")

# Safe cleanup – files only deleted if no active readers/writers
dt.vacuum(retention_hours=168) # Built-in safety: won't delete files in use

# Measure storage after cleanup
after_size = get_size("data/taxi_delta_table")

print(f"Delta vacuum completed safely")
print(f"Storage before: {before_size:.1f} MB")
print(f"Storage after: {after_size:.1f} MB")
print(f"Space reclaimed: {before_size – after_size:.1f} MB")

Output:
Delta vacuum completed safely
Storage before: 8.2 MB
Storage after: 5.7 MB
Space reclaimed: 2.5 MB

Delta vacuum removed 2.5 MB of obsolete file versions, reducing storage footprint by 30% while maintaining ACID transaction guarantees and time travel capabilities.
Conclusion
Delta-rs transforms the traditional pandas workflow by providing:

Incremental updates append only changed records without full rewrites
Time travel and versioning enable recovery and auditing without manual backups
Schema evolution handles column changes without breaking queries
Merge operations combine updates and inserts in single transactions
Multi-engine support lets pandas, DuckDB, and Polars access the same data
Automatic vacuum reclaims storage by removing obsolete file versions

The bridge from pandas prototyping to production data pipelines no longer requires complex infrastructure. Delta-rs provides the reliability and performance you need while maintaining the simplicity you want.
Related Tutorials

Alternative Scaling: Scaling Pandas Workflows with PySpark’s Pandas API for Spark-based approaches
Data Versioning: Version Control for Data and Models Using DVC for broader versioning strategies
DataFrame Performance: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrame optimization techniques

Favorite

Delta Lake: Transform pandas Prototypes into Production Read More »

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

2 Comments / Blog, DataFrame, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Why Consider Polars?
Setup
Multi-Core Performance

Timing Decorator
Plotting Setup
Reading CSV Files
Groupby Mean
Filter Rows
Sort by Column
Observations

Lazy Evaluation (Only in Polars)
Syntax Comparison

Filtering rows
Selecting columns
Chained operations

Memory Efficiency
Missing Features (Where Pandas Wins)
Summary

When to Use Polars vs. Pandas

Final Thoughts

Introduction
Pandas has been the standard tool for working with tabular data in Python for many years. But as datasets get larger and performance becomes more important, Polars, an increasingly popular choice for handling data in Polars Python, is a faster alternative that uses multiple CPU cores.
This guide compares Polars and Pandas, highlighting where Polars shines and showing how to get started with it.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Key Takeaways
Here’s what you’ll learn:

Achieve 9× faster CSV reading with Polars’ multi-threaded architecture
Reduce memory usage by 30-50% using Apache Arrow backend optimization
Enable lazy evaluation for automatic query optimization and predicate pushdown
Implement seamless migration from Pandas with similar API patterns
Scale data processing without changing code using built-in parallel execution

Why Consider Polars?
Polars is a DataFrame library written in Rust, designed for high performance. Key advantages:

Multi-threaded by default
Lazy execution for optimization
Low memory usage
Strong support for Arrow and Parquet

Let’s explore how Polars improves upon Pandas with practical examples.

Setup
Install both libraries and generate a sample dataset for testing:
pip install pandas polars

To follow along with the examples, you can generate a large CSV file like this:
import pandas as pd
import numpy as np

# Create a large dataset
n_rows = 10_000_000
data = {
"category": np.random.choice(["A", "B", "C", "D"], size=n_rows),
"value": np.random.rand(n_rows) * 1000
}
df = pd.DataFrame(data)

This will create a large_file.csv with 10 million rows for performance testing.
Multi-Core Performance
Pandas uses multiple CPU cores for data processing, while Pandas relies on a single core. Thus, Polars manages larger datasets more efficiently and often performs operations faster than Pandas.

Let’s benchmark a few common operations using Pandas and Polars: reading CSVs, groupby, filter, and sort.
Timing Decorator
We define a reusable timing decorator to simplify benchmarking in this section:
import time
from functools import wraps

def timeit(operation_name):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
execution_time = end_time – start_time
print(f"{operation_name} took {execution_time:.2f} seconds")
return result, execution_time
return wrapper
return decorator

Plotting Setup
Let’s define a helper function to simplify chart creation for each operation.
import matplotlib.pyplot as plt
import seaborn as sns

def create_comparison_plot(pandas_time, polars_time, title, plt, sns):
sns.barplot(
hue=["Pandas", "Polars"],
y=[pandas_time, polars_time],
palette=["#E583B6", "#72BEFA"],
)
plt.title(f"{title} (seconds)")
plt.ylabel("Time (s)")
plt.show()

Reading CSV Files
Pandas:
@timeit("Pandas read_csv")
def read_pandas():
return pd.read_csv("large_file.csv")

pandas_df, pandas_read_time = read_pandas()

Output:
Pandas read_csv took 1.38 seconds

Polars:
import polars as pl

@timeit("Polars read_csv")
def read_polars():
return pl.read_csv("large_file.csv")

polars_df, polars_read_time = read_polars()

Output:
Polars read_csv took 0.15 seconds

create_comparison_plot(pandas_read_time, polars_read_time, "CSV Read Time")

Groupby Mean
Pandas:
@timeit("Pandas groupby")
def pandas_groupby(df):
return df.groupby("category")["value"].mean()

pandas_result, pandas_groupby_time = pandas_groupby(pandas_df)

Output:
Pandas groupby took 0.53 seconds

Polars:
@timeit("Polars groupby")
def polars_groupby(df):
return df.group_by("category").agg(pl.col("value").mean())

polars_result, polars_groupby_time = polars_groupby(polars_df)

Output:
Polars groupby took 0.20 seconds

create_comparison_plot(pandas_groupby_time, polars_groupby_time, "Groupby Mean Time")

Filter Rows
Pandas:
@timeit("Pandas filter")
def pandas_filter(df):
return df[df["value"] > 500]

pandas_filtered, pandas_filter_time = pandas_filter(pandas_df)

Output:
Pandas filter took 0.15 seconds

Polars:
@timeit("Polars filter")
def polars_filter(df):
return df.filter(pl.col("value") > 500)

polars_filtered, polars_filter_time = polars_filter(polars_df)

Output:
Polars filter took 0.03 seconds

create_comparison_plot(pandas_filter_time, polars_filter_time, "Filter Rows Time")

Sort by Column
Pandas:
@timeit("Pandas sort")
def pandas_sort(df):
return df.sort_values("value")

pandas_sorted, pandas_sort_time = pandas_sort(pandas_df)

Output:
Pandas sort took 2.74 seconds

Polars:
@timeit("Polars sort")
def polars_sort(df):
return df.sort("value")

polars_sorted, polars_sort_time = polars_sort(polars_df)

Output:
Polars sort took 0.49 seconds

create_comparison_plot(pandas_sort_time, polars_sort_time, "Sort Time")

Observations
Polars consistently outperformed Pandas across all operations:

CSV Reading: 9× faster
Groupby: Over 2.5× faster
Filtering: 5× faster
Sorting: 5.5× faster

This performance advantage is largely due to Polars’ native Rust engine and multi-threaded execution. It’s especially valuable for large-scale data processing pipelines.
For SQL-based analytics that complement Polars’ performance advantages, see our DuckDB deep dive guide.
Lazy Evaluation (Only in Polars)
Pandas executes operations eagerly, which can be inefficient for large pipelines. For example:
(
pd.read_csv("large_file.csv")
.query('value > 100')
.groupby('category', as_index=False)['value']
.mean()
.rename(columns={"value": "avg_value"})
)

This code is inefficient because:

Full CSV Load: Pandas loads the entire file into memory, even rows or columns we won’t use.
No Predicate Pushdown: The .query() filter is applied after reading all rows, not during the read.
No Projection Pushdown: All columns are read from disk, even if we only use value and category.

Polars supports lazy evaluation, which plans and optimizes the entire computation graph before execution:
query = (
pl.read_csv("large_file.csv").lazy()
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value"))
)

print(query.explain())

Output:
FILTER [(col("avg_value")) > (100.0)]
FROM
AGGREGATE
[col("value").mean().alias("avg_value")] BY [col("category")]
FROM
DF ["category", "value"]; PROJECT["value", "category"] 2/2 COLUMNS

Optimizations used by Polars:

Lazy Execution: The computation is deferred and optimized before execution.
Predicate Pushdown: Filters like value > 100 are applied during the scan, skipping irrelevant rows.
Projection Pushdown: Only necessary columns ( category and value) are read.
Single-Pass Execution: Operations can be executed in a streaming manner, improving performance.

The pipeline is only executed when .collect() is called:
result = query.collect()
result.head()

Output:
shape: (4, 2)
┌──────────┬────────────┐
│ category ┆ avg_value │
│ — ┆ — │
│ str ┆ f64 │
╞══════════╪════════════╡
│ A ┆ 550.245739 │
│ C ┆ 549.985808 │
│ D ┆ 549.649933 │
│ B ┆ 550.055504 │
└──────────┴────────────┘

Syntax Comparison
Both libraries offer expressive APIs, but Polars’ syntax is more functional and chainable, which makes it easier to write pipelines with multiple operations in a single expression.
Filtering rows
Pandas:
df_pd.query("value > 1000")

Polars:
df_pl.filter(pl.col("value") > 1000)

Selecting columns
Pandas:
df_pd[["category", "value"]]

Polars:
df_pl.select(["category", "value"])

Polars feels more like writing SQL pipelines, especially in lazy mode.
Chained operations
Pandas:
(df_pd.query("value > 1000")
.groupby("category")["value"]
.mean()
.reset_index())

Polars:
(df_pl
.filter(pl.col("value") > 1000)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value")))

Polars lets you chain operations together, which often makes code more readable and efficient.
Memory Efficiency
Polars has its own columnar and compact implementation optimized for performance. This design avoids Python object overhead, unlike Pandas.
df_pd.memory_usage(deep=True).sum() / 1e6 # MB
df_pl.estimated_size() / 1e6 # MB

Polars usually consumes less memory, especially with strings or categorical data.
Missing Features (Where Pandas Wins)
While Polars is fast, Pandas is still better at:

Interoperability with other Python libraries (e.g., matplotlib, scikit-learn)
Rich ecosystem (e.g., pandas-profiling, pandasql)
More mature and flexible time series tools

Summary
When to Use Polars vs. Pandas

Scenario
Use Polars ✅
Use Pandas ✅

Large datasets and file sizes
✅ Yes
⚠️ May struggle with memory

Fast CSV/Parquet I/O
✅ Much faster
🚫 Slower for large files

Multi-core performance
✅ Built-in multi-threading
🚫 Single-threaded by default

Lazy evaluation for optimized pipelines
✅ Supported
🚫 Not available

Low memory consumption
✅ Apache Arrow backend
🚫 Higher memory overhead

Integration with scikit-learn, statsmodels, etc.
🚫 Limited
✅ Excellent ecosystem integration

Flexible time series handling
🚫 Basic support
✅ Advanced and robust

Small-to-medium datasets
⚠️ May be overkill
✅ Simple and effective

For scaling beyond single-machine processing, see our PySpark Pandas API guide.
Final Thoughts
Polars is not here to replace Pandas—but it’s an excellent choice when performance matters. If your workflows are slow or memory-bound, Polars can drastically reduce execution time without compromising expressiveness.
Tip: You can even mix both libraries. Load and process large files with Polars, then convert to Pandas when needed:
df_pd = df_pl.to_pandas()

For modern Python development best practices that complement Polars’ performance focus, see our UV package manager guide.
Favorite

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames Read More »

DuckDB: Simplify DataFrame Analysis with Serverless SQL

2 Comments / DuckDB, Pandas, SQL / Khuyen Tran

Using SQL with pandas empowers data scientists to leverage SQL’s powerful querying capabilities alongside the data manipulation functionalities of pandas.

However, traditional database systems often demand the management of a separate DBMS server, introducing additional complexity to the workflow.

With DuckDB, you can efficiently run SQL operations on pandas DataFrames without the need to manage a separate DBMS server.

DuckDB: Simplify DataFrame Analysis with Serverless SQL Read More »

Fuzzy Joining Tables with Non-Exact Matching Entries

Leave a Comment / Feature Engineer, Pandas / Khuyen Tran

Frequently, entries in one table may not have an exact match in another table. For instance, one table might have “Yemen*” while the other has “Yemen, Rep.”

The fuzzy_join() function in skrub enables joining tables without the need for extensive data cleaning by accounting for the label variations.

Fuzzy Joining Tables with Non-Exact Matching Entries Read More »

Pandera: Data Validation Made Simple for Python DataFrames

Leave a Comment / Pandas, Testing / Khuyen Tran

Poor data quality can lead to incorrect conclusions and bad model performance. Thus, it is important to check data for consistency and reliability before using it.

pandera makes it easy to perform data validation on dataframe-like objects. If the dataframe does not pass validation checks, pandera provides useful error messages.

Pandera: Data Validation Made Simple for Python DataFrames Read More »

DuckDB + PyArrow: 2900x Faster Than pandas for Large Dataset Processing

1 Comment / Delta Lake, DuckDB, Pandas, SQL / Khuyen Tran

DuckDB optimizes query execution with multiple optimizations, while PyArrow efficiently manages in-memory data processing and storage. Combining DuckDB and PyArrow allows you to efficiently process datasets larger than memory on a single machine.

In the code above, we convert a Delta Lake table with over 6 million rows to a pandas DataFrame and a PyArrow dataset, which are then used by DuckDB.

Running DuckDB on a PyArrow dataset is approximately 2906 times faster than running DuckDB on a pandas DataFrame.

DuckDB + PyArrow: 2900x Faster Than pandas for Large Dataset Processing Read More »

How to Load SQL Tables into Pandas DataFrames

Leave a Comment / Pandas, SQL / Khuyen Tran

Loading SQL tables into DataFrames allows you to analyze and preprocess the data using the rich functionality of pandas.

To read a SQL table into a pandas DataFrame, pass the database connection obtained from the SQLAlchemy Engine to the pandas.read_sql method.

How to Load SQL Tables into Pandas DataFrames Read More »

Great Tables: Create Scientific-Looking Tables in Python

Leave a Comment / Better Outputs, Pandas / Khuyen Tran

If you want to create visually appealing and scientific-looking tables in Python, use Great Tables. With Great Tables, you can customize your table by mixing and matching various elements such as headers, footers, and cell value formats.

Link to Great Tables.
Favorite

Great Tables: Create Scientific-Looking Tables in Python Read More »

Use Resample to Alter Time-Series Data Frequency

Process Postgres Tables on Schedule with Kestra and Pandas

Leave a Comment / Pandas, SQL, Workflow Automation / Khuyen Tran

Imagine you have multiple tables from Postgres that you need to extract concurrently and process using Pandas on a schedule. How can you achieve this?

Kestra, an open-source orchestrator, enables you to effortlessly create this workflow using a YAML file.

Link to Kestra.
Favorite

Process Postgres Tables on Schedule with Kestra and Pandas Read More »

Pandas

Delta Lake: Transform pandas Prototypes into Production

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

DuckDB: Simplify DataFrame Analysis with Serverless SQL

Fuzzy Joining Tables with Non-Exact Matching Entries

Pandera: Data Validation Made Simple for Python DataFrames

DuckDB + PyArrow: 2900x Faster Than pandas for Large Dataset Processing

How to Load SQL Tables into Pandas DataFrames

Great Tables: Create Scientific-Looking Tables in Python

Use Resample to Alter Time-Series Data Frequency

Process Postgres Tables on Schedule with Kestra and Pandas

Drop a line

Get in touch

Follow Us on Social Media

Pandas

Work with Khuyen Tran

Work with Khuyen Tran