python Archives

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives)

4 Comments / Blog, Python Utilities / Khuyen Tran

Table of Contents

Introduction
What Are Typed Data Containers?
Using Dictionaries
Using NamedTuple
Using dataclass
Using Pydantic
Final Thoughts
Related Tutorials

Introduction
Imagine you’re processing customer records. The pipeline runs without errors, but customers never receive their welcome emails. After digging through the code, you discover the issue is a simple typo in a dictionary key.
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo

def send_welcome_email(customer):
email = customer.get("email") # Returns None silently
if email:
print(f"Sending email to {email}")
# No email sent, no error raised

customer = load_customer(["C001", "Alice", "alice@example.com"])
send_welcome_email(customer) # Nothing happens

Since .get() returns None for a missing key, the bug stays hidden.
This is exactly the type of issue we want to catch earlier. In this article, we’ll look at how typed data containers like NamedTuple, dataclass, and Pydantic help surface these bugs at runtime.

Interactive Course: Master Python data containers with hands-on exercises in our interactive Python data containers course.

What Are Typed Data Containers?
Python offers several ways to structure data, each adding more safety than the last:

dict: No protection. Bugs surface only when you access a missing key.
NamedTuple: Basic safety. Catches typos at write time in your IDE and at runtime.
dataclass: Static analysis support. Tools like mypy catch errors before your code runs.
Pydantic: Full protection. Validates data the moment you create an instance.

Let’s see how each tool handles the same customer data:
Using Dictionaries
Dictionaries are quick to create but provide no safety:
customer = {
"customer_id": "C001",
"name": "Alice Smith",
"email": "alice@example.com",
"age": 28,
"is_premium": True,
}

print(customer["name"])

Alice Smith

Typo Bugs
A typo in the key name causes a KeyError at runtime:
customer["emial"] # Typo: should be "email"

KeyError: 'emial'

The error tells you what went wrong but not where. When dictionaries pass through multiple functions, finding the source of a typo can take significant debugging time:
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo here

def validate_customer(customer):
return customer # Passes through unchanged

def send_email(customer):
return customer["email"] # KeyError raised here

customer = load_customer(["C001", "Alice", "alice@example.com"])
validated = validate_customer(customer)
send_email(validated) # Error points here, but bug is in load_customer

KeyError Traceback (most recent call last)
13 customer = load_customer(["C001", "Alice", "alice@example.com"])
14 validated = validate_customer(customer)
—> 15 send_email(validated) # Error points here, but bug is in load_customer

Cell In[6], line 10, in send_email(customer)
9 def send_email(customer):
—> 10 return customer["email"]

KeyError: 'email'

The stack trace shows where the KeyError was raised, not where "emial" was written. The bug and its symptom are 13 lines apart here, but in production code, they could be in different files entirely.
Using .get() makes it worse by returning None silently:
email = customer.get("email") # Returns None – key is "emial" not "email"
print(f"Sending email to: {email}")

Sending email to: None

This silent failure is dangerous: your notification system might skip thousands of customers, or worse, your code could write None to a database column, corrupting your data pipeline.
Type Confusion
Typos cause crashes, but wrong types can corrupt your data silently. Since dictionaries have no schema, nothing stops you from assigning the wrong type to a field:
customer = {
"customer_id": "C001",
"name": 123, # Should be a string
"age": "twenty-eight", # Should be an integer
}

total_age = customer["age"] + 5

TypeError: can only concatenate str (not "int") to str

The error message is misleading: it says “concatenate str” but the real problem is that age should never have been a string in the first place.
Using NamedTuple
NamedTuple is a lightweight way to define a fixed structure with named fields and type hints, like a dictionary with a schema:
from typing import NamedTuple

class Customer(NamedTuple):
customer_id: str
name: str
email: str
age: int
is_premium: bool

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
is_premium=True,
)

print(customer.name)

Alice Smith

IDE Autocomplete Catches Typos
Your IDE can’t autocomplete dictionary keys, so typing customer[" shows no suggestions. With NamedTuple, typing customer. displays all available fields: customer_id, name, email, age, is_premium.
Even if you skip autocomplete and type manually, typos are flagged instantly with squiggly lines:
customer.emial
~~~~~

Running the code will raise an error:
customer.emial

AttributeError: 'Customer' object has no attribute 'emial'

The error names the exact object and missing attribute, so you know immediately what to fix.
Immutability Prevents Accidental Changes
NamedTuples are immutable, meaning once created, their values cannot be changed:
customer.name = "Bob" # Raises an error

AttributeError: can't set attribute

This prevents bugs where data is accidentally modified during processing.
Limitations: No Runtime Type Validation
Type hints in NamedTuple are not enforced at runtime, so you can still pass in wrong types:
# Wrong types are accepted without error
customer = Customer(
customer_id="C001",
name=123, # Should be str, but int is accepted
email="alice@example.com",
age="twenty-eight", # Should be int, but str is accepted
is_premium=True,
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

The code runs, but with incorrect data types. The bug surfaces later when you try to use the data.
Using dataclass
dataclass reduces the boilerplate of writing classes that mainly hold data. Instead of manually writing __init__ and other methods, you just declare your fields.
It provides the same IDE support as NamedTuple, plus three additional features:

Mutable objects: You can change field values after creation
Mutable defaults: Safe defaults for lists and dicts with field(default_factory=list)
Post-init logic: Run custom validation or compute derived fields with __post_init__

from dataclasses import dataclass

@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False # Default value

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Smith, Premium: False

Mutability Allows Updates
Dataclass trades NamedTuple’s immutability protection for flexibility. You can modify fields after creation:
customer.name = "Alice Johnson" # Changed after marriage
customer.is_premium = True # Upgraded their account

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Johnson, Premium: True

For extra safety, use @dataclass(slots=True) to prevent accidentally adding new attributes:
@dataclass(slots=True)
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice",
email="alice@example.com",
age=28,
)

customer.nmae = "Bob" # Typo

AttributeError: 'Customer' object has no attribute 'nmae'

Mutable Defaults with default_factory
Mutable defaults like lists don’t work as expected. You might think each instance gets its own empty list, but Python creates the default [] once and all instances share it:
from typing import NamedTuple

class Order(NamedTuple):
order_id: str
items: list = []

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Also has "apple"!

Order 1: ['apple']
Order 2: ['apple']

Order 2 has “apple” even though we only added it to Order 1. Modifying one order’s items affects every order.
Dataclass prevents this mistake by rejecting mutable defaults:
@dataclass
class Order:
items: list = []

ValueError: mutable default <class 'list'> for field items is not allowed: use default_factory

Dataclass offers field(default_factory=…) as the solution. The factory function runs at instance creation, not class definition, so each object gets its own list:
from dataclasses import dataclass, field

@dataclass
class Order:
order_id: str
items: list = field(default_factory=list) # Each instance gets its own list

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Not affected by order1

Order 1: ['apple']
Order 2: []

Unlike the NamedTuple example, Order 2 stays empty because it has its own list.
Post-Init Validation with __post_init__
Without validation, invalid data passes through silently:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="", # Empty name
email="invalid",
age=-100,
)
print(f"Created: {customer}") # No error – bad data is in your system

Created: Customer(customer_id='C001', name='', email='invalid', age=-100, is_premium=False)

Dataclass provides __post_init__ to catch these issues at creation time so you can validate fields before the object is used:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if self.age < 0:
raise ValueError(f"Age cannot be negative: {self.age}")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name="Alice",
email="invalid-email",
age=28,
)

ValueError: Invalid email: invalid-email

The error message tells you exactly what’s wrong, making the bug easy to fix.
Limitations: Manual Validation Only
__post_init__ requires you to write every validation rule yourself. If you forget to check a field, bad data can still slip through.
In this example, __post_init__ only validates email format, so wrong types for name and age pass undetected:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name=123, # No validation for name type
email="alice@example.com",
age="twenty-eight", # No validation for age type
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

Type hints alone don’t enforce types at runtime. For automatic validation, you need a library that actually checks types when objects are created.

📚 For comprehensive coverage of dataclasses and Pydantic in production workflows, check out Production-Ready Data Science.

Using Pydantic
Pydantic is a data validation library that enforces type hints at runtime. Unlike NamedTuple and dataclass, it actually checks that values match their declared types when objects are created. Install it with:
pip install pydantic

To create a Pydantic model, inherit from BaseModel and declare your fields with type hints:
from pydantic import BaseModel

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Age: {customer.age}")

Alice Smith, Age: 28

For using Pydantic to enforce structured outputs from AI models, see our PydanticAI tutorial.
Runtime Validation
Remember how dataclass accepted name=123 without complaint? Pydantic catches this automatically with a ValidationError:
from pydantic import BaseModel, ValidationError

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name=123,
email="alice@example.com",
age="thirty",
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
Input should be a valid string [type=string_type, input_value=123, input_type=int]
age
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='thirty', input_type=str]

The error message shows:

Which fields failed validation (name, age)
What was expected (valid string, valid integer)
What was received (123 as int, 'thirty' as str)

This tells you everything you need to fix the bug in one place, instead of digging through stack traces.
Type Coercion
Unlike dataclass which stores whatever you pass, Pydantic automatically converts compatible types to match your type hints:
customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age="28", # String "28" is converted to int 28
is_premium="true", # String "true" is converted to bool True
)

print(f"Age: {customer.age} (type: {type(customer.age).__name__})")
print(f"Premium: {customer.is_premium} (type: {type(customer.is_premium).__name__})")

Age: 28 (type: int)
Premium: True (type: bool)

This is useful when reading data from CSV files or APIs where everything comes as strings.
Constraint Validation
Beyond types, you often need business rules: age must be positive, names can’t be empty, customer IDs must follow a pattern.
In dataclass, you define fields in one place and validate them in __post_init__. The validation logic grows with each constraint:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if not self.customer_id:
raise ValueError("Customer ID cannot be empty")
if not self.name or len(self.name) < 1:
raise ValueError("Name cannot be empty")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")
if self.age < 0 or self.age > 150:
raise ValueError(f"Age must be between 0 and 150: {self.age}")

Pydantic puts constraints directly in Field(), keeping rules next to the data they validate:
from pydantic import BaseModel, Field, ValidationError

class Customer(BaseModel):
customer_id: str
name: str = Field(min_length=1)
email: str
age: int = Field(ge=0, le=150) # Age must be between 0 and 150
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name="", # Empty name
email="alice@example.com",
age=-5, # Negative age
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
String should have at least 1 character [type=string_too_short, input_value='', input_type=str]
age
Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-5, input_type=int]

Nested Validation
Data structures are rarely flat. A customer has an address, an order contains items. When something is wrong inside a nested object, you need to know exactly where.
Pydantic validates each level and reports the full path to any error:
from pydantic import BaseModel, Field, ValidationError

class Address(BaseModel):
street: str
city: str
zip_code: str = Field(pattern=r"^\d{5}$") # Must be 5 digits

class Customer(BaseModel):
customer_id: str
name: str
address: Address

try:
customer = Customer(
customer_id="C001",
name="Alice Smith",
address={
"street": "123 Main St",
"city": "New York",
"zip_code": "invalid", # Invalid zip code
},
)
except ValidationError as e:
print(e)

1 validation error for Customer
address.zip_code
String should match pattern '^\d{5}$' [type=string_pattern_mismatch, input_value='invalid', input_type=str]

The error message shows address.zip_code, pinpointing the exact location in the nested structure.
For extracting structured data from documents using Pydantic, see our LlamaIndex data extraction guide.
Final Thoughts
To summarize what each tool provides:

dict: Quick to create. No structure or validation.
NamedTuple: Fixed structure with IDE autocomplete. Immutable.
dataclass: Mutable fields, safe defaults, custom logic via __post_init__.
Pydantic: Runtime type enforcement, automatic type coercion, built-in constraints.

Personally, I use dict for quick prototyping:
stats = {"rmse": 0.234, "mae": 0.189, "r2": 0.91}

Then Pydantic when the code moves to production. For example, a training config should reject invalid values like negative learning rates:
from pydantic import BaseModel, Field

class TrainingConfig(BaseModel):
epochs: int = Field(ge=1)
batch_size: int = Field(ge=1)
learning_rate: float = Field(gt=0)

config = TrainingConfig(epochs=10, batch_size=32, learning_rate=0.001)

Pick the level of protection that matches your needs. A notebook experiment doesn’t need Pydantic, but a production API does.
Related Tutorials

SQLModel vs psycopg2: Combine Pydantic-style validation with database integration
Pytest for Data Scientists: Test your data containers and processing pipelines
Hydra for Python Configuration: Manage validated configuration with YAML-based pipelines

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives) Read More »

Newsletter #256: Build Scalable Pipelines with DuckDB Memory Spilling

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Marimo: Keep All Notebook Cells in Sync Without Manual Reruns

Problem
In Jupyter notebooks, changing an input value doesn’t automatically update dependent cells.
Forget to rerun one cell, and you might make decisions based on outdated results without realizing anything is wrong.
Solution
Marimo automatically detects changes and re-executes all dependent cells.
When you change a variable like threshold from 50 to 30, every downstream cell that uses it updates immediately.

📖 Learn more

🧪 Run code

⭐ View GitHub

Build Scalable Pipelines with DuckDB Memory Spilling

Problem
When datasets exceed available RAM, most tools crash mid-operation.
This forces manual data chunking or expensive hardware upgrades just to complete basic queries.
Solution
DuckDB automatically spills intermediate results to temporary files when data exceeds configured memory limits.
Key benefits:

Process datasets larger than RAM without code changes
Configure memory limits to prevent system crashes
Automatic disk spillover when memory fills
No manual chunking or batching required

📖 View Full Article

🧪 Run code

⭐ View GitHub

📢 ANNOUNCEMENTS

Cyber Monday: 30% Off Production-Ready Data Science
My book Production-Ready Data Science is on sale for Cyber Monday.
Get 58% off the ebook or 10% off the paperback through December 8th.
The book covers everything I’ve learned about taking data science from prototype to production: dependency management, testing, CI/CD, and workflow automation.

→ Get 58% Off Now

☕️ Weekly Finds

Nano-PDF
[LLM]
– Natural language PDF editing using Gemini with multi-page parallel processing

Codon
[Python Utils]
– High-performance Python compiler that generates native machine code for 10-100x speedups

lm-evaluation-harness
[LLM]
– Unified framework for few-shot evaluation of language models across 200+ tasks

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #256: Build Scalable Pipelines with DuckDB Memory Spilling Read More »

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

4 Comments / Blog, DataFrame, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Tool Strengths at a Glance
Setup
Syntax Comparison
Data Loading Performance
Query Optimization
GroupBy Performance
Memory Efficiency
Join Operations
Interoperability
Decision Matrix
Final Thoughts

Introduction
pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.
Each tool excels in different scenarios:

Tool
Backend
Execution Model
Best For

pandas
C/Python
Eager, single-threaded
Small datasets, prototyping, ML integration

Polars
Rust
Lazy/Eager, multi-threaded
Large-scale analytics, data pipelines

DuckDB
C++
SQL-first, multi-threaded
SQL workflows, embedded analytics, file queries

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Tool Strengths at a Glance
pandas
pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

Direct compatibility with scikit-learn, statsmodels, and visualization libraries
Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
Mature time series functionality
Familiar syntax that most data scientists already know

Polars
Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

Speeds up operations by using all available CPU cores by default
Builds a query plan first, then executes only what’s needed
Streaming mode for processing datasets larger than RAM
Expressive method chaining with a pandas-like API

DuckDB
DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

Native SQL syntax with full analytical query support
Queries CSV, Parquet, and JSON files directly without loading
Uses disk storage automatically when data exceeds available memory
Zero-configuration embedded database requiring no server setup

Setup
Install all three libraries:
pip install pandas polars duckdb

Generate sample data for benchmarking:
import pandas as pd
import numpy as np

np.random.seed(42)
n_rows = 5_000_000

data = {
"category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
"region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
"amount": np.random.rand(n_rows) * 1000,
"quantity": np.random.randint(1, 100, size=n_rows),
}

df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")

Created sales_data.csv with 5,000,000 rows

Syntax Comparison
All three tools can perform the same operations with different syntax. Here’s a side-by-side comparison of common tasks.
Filtering Rows
pandas:
Uses bracket notation with boolean conditions, which requires repeating the DataFrame name for each condition:
import pandas as pd

df_pd = pd.read_csv("sales_data.csv")
result_bracket = df_pd[(df_pd["amount"] > 500) & (df_pd["category"] == "Electronics")]
result_bracket.head()

category
region
amount
quantity

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

30
Electronics
North
766.081832
7

31
Electronics
West
772.084261
36

37
Electronics
East
527.967145
35

Alternatively, you can use the query() method, which provides cleaner SQL-like syntax:
result_query = df_pd.query("amount > 500 and category == 'Electronics'")

However, since query() is string-based, there’s no IDE autocomplete. Complex operations like string methods still require brackets:
result_str = df_pd[df_pd["category"].str.startswith("Elec")]
result_str.head()

category
region
amount
quantity

2
Electronics
North
450.941022
93

6
Electronics
West
475.843957
61

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

21
Electronics
South
292.399383
13

Polars:
Unlike pandas, Polars uses one syntax for all filters. The pl.col() expressions are type-safe with IDE autocomplete, and handle both simple comparisons and complex operations like string methods:
import polars as pl

df_pl = pl.read_csv("sales_data.csv")
result_pl = df_pl.filter(
(pl.col("amount") > 500) & (pl.col("category") == "Electronics")
)
result_pl.head()

category
region
amount
quantity

str
str
f64
i64

“Electronics”
“West”
662.803066
80

“Electronics”
“North”
826.004963
25

“Electronics”
“North”
766.081832
7

“Electronics”
“West”
772.084261
36

“Electronics”
“East”
527.967145
35

DuckDB:
Uses standard SQL with a WHERE clause, which is more readable by those who know SQL.
import duckdb

result_duckdb = duckdb.sql("""
SELECT * FROM 'sales_data.csv'
WHERE amount > 500 AND category = 'Electronics'
""").df()
result_duckdb.head()

category
region
amount
quantity

0
Electronics
West
662.803066
80

1
Electronics
North
826.004963
25

2
Electronics
North
766.081832
7

3
Electronics
West
772.084261
36

4
Electronics
East
527.967145
35

Selecting Columns
pandas:
Double brackets return a DataFrame with selected columns.
result_pd = df_pd[["category", "amount"]]
result_pd.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

Polars:
The select() method clearly communicates column selection intent.
result_pl = df_pl.select(["category", "amount"])
result_pl.head()

category
amount

str
f64

“Food”
516.653322

“Books”
937.337226

“Electronics”
450.941022

“Food”
674.488081

“Food”
188.847906

DuckDB:
SQL’s SELECT clause makes column selection intuitive for SQL users.
result_duckdb = duckdb.sql("""
SELECT category, amount FROM 'sales_data.csv'
""").df()
result_duckdb.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

GroupBy Aggregation
pandas:
Uses a dictionary to specify aggregations, but returns multi-level column headers that often require flattening before further use.
result_pd = df_pd.groupby("category").agg({
"amount": ["sum", "mean"],
"quantity": "sum"
})
result_pd.head()

amount

quantity

sum
mean
sum

Books
6.247506e+08
499.998897
62463285

Clothing
6.253924e+08
500.139837
62505224

Electronics
6.244453e+08
499.938189
62484265

Food
6.254034e+08
499.916417
62577943

Polars:
Uses explicit alias() calls for each aggregation, producing flat column names directly without post-processing.
result_pl = df_pl.group_by("category").agg([
pl.col("amount").sum().alias("amount_sum"),
pl.col("amount").mean().alias("amount_mean"),
pl.col("quantity").sum().alias("quantity_sum"),
])
result_pl.head()

category
amount_sum
amount_mean
quantity_sum

str
f64
f64
i64

“Clothing”
6.2539e8
500.139837
62505224

“Books”
6.2475e8
499.998897
62463285

“Electronics”
6.2445e8
499.938189
62484265

“Food”
6.2540e8
499.916417
62577943

DuckDB:
Standard SQL aggregation with column aliases produces clean, flat output ready for downstream use.
result_duckdb = duckdb.sql("""
SELECT
category,
SUM(amount) as amount_sum,
AVG(amount) as amount_mean,
SUM(quantity) as quantity_sum
FROM 'sales_data.csv'
GROUP BY category
""").df()
result_duckdb.head()

category
amount_sum
amount_mean
quantity_sum

0
Food
6.254034e+08
499.916417
62577943.0

1
Electronics
6.244453e+08
499.938189
62484265.0

2
Clothing
6.253924e+08
500.139837
62505224.0

3
Books
6.247506e+08
499.998897
62463285.0

Adding Columns
pandas:
The assign() method creates new columns with repeated DataFrame references like df_pd["amount"].
result_pd = df_pd.assign(
amount_with_tax=df_pd["amount"] * 1.1,
high_value=df_pd["amount"] > 500
)
result_pd.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Polars:
The with_columns() method uses composable expressions that chain naturally without repeating the DataFrame name.
result_pl = df_pl.with_columns([
(pl.col("amount") * 1.1).alias("amount_with_tax"),
(pl.col("amount") > 500).alias("high_value")
])
result_pl.head()

category
region
amount
quantity
amount_with_tax
high_value

str
str
f64
i64
f64
bool

“Food”
“South”
516.653322
40
568.318654
true

“Books”
“East”
937.337226
45
1031.070948
true

“Electronics”
“North”
450.941022
93
496.035124
false

“Food”
“East”
674.488081
46
741.936889
true

“Food”
“East”
188.847906
98
207.732697
false

DuckDB:
SQL’s SELECT clause defines new columns directly in the query, keeping transformations readable.
result_duckdb = duckdb.sql("""
SELECT *,
amount * 1.1 as amount_with_tax,
amount > 500 as high_value
FROM df_pd
""").df()
result_duckdb.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Conditional Logic
pandas:
Each additional condition in np.where() adds another nesting level. With three tiers, the final value is buried two levels deep:
import numpy as np

# Hard to read: "low" is nested inside two np.where() calls
result_np = df_pd.assign(
value_tier=np.where(
df_pd["amount"] > 700, "high",
np.where(df_pd["amount"] > 300, "medium", "low")
)
)
result_np[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

For numeric binning, pd.cut() is cleaner:
result_pd = df_pd.assign(
value_tier=pd.cut(
df_pd["amount"],
bins=[-np.inf, 300, 700, np.inf],
labels=["low", "medium", "high"]
)
)
result_pd[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

However, pd.cut() has drawbacks:

Only works for numeric ranges
Requires thinking in boundaries ([-inf, 300, 700, inf]) instead of conditions (amount > 700)
Needs numpy for open-ended bins

For non-numeric or mixed conditions, you’re back to np.where():
# "premium" if Electronics AND amount > 500 – pd.cut() can't do this
result = df_pd.assign(
tier=np.where(
(df_pd["category"] == "Electronics") & (df_pd["amount"] > 500),
"premium", "standard"
)
)
result.head()

category
region
amount
quantity
tier

0
Food
South
516.653322
40
standard

1
Books
East
937.337226
45
standard

2
Electronics
North
450.941022
93
standard

3
Food
East
674.488081
46
standard

4
Food
East
188.847906
98
standard

Polars:
The when().then().otherwise() chain solves both pandas problems: no nesting like np.where(), and works for any condition (not just numeric ranges like pd.cut()). The same syntax handles simple binning and complex mixed conditions:
result_pl = df_pl.with_columns(
pl.when(pl.col("amount") > 700).then(pl.lit("high"))
.when(pl.col("amount") > 300).then(pl.lit("medium"))
.otherwise(pl.lit("low"))
.alias("value_tier")
)
result_pl.select(["category", "amount", "value_tier"]).head()

category
amount
value_tier

str
f64
str

“Food”
516.653322
“medium”

“Books”
937.337226
“high”

“Electronics”
450.941022
“medium”

“Food”
674.488081
“medium”

“Food”
188.847906
“low”

DuckDB:
Standard SQL CASE WHEN syntax is more readable by those who know SQL.
result_duckdb = duckdb.sql("""
SELECT category, amount,
CASE
WHEN amount > 700 THEN 'high'
WHEN amount > 300 THEN 'medium'
ELSE 'low'
END as value_tier
FROM df_pd
""").df()
result_duckdb.head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

Window Functions
pandas:
Uses groupby().transform() which requires repeating the groupby clause for each calculation.
result_pd = df_pd.assign(
category_avg=df_pd.groupby("category")["amount"].transform("mean"),
category_rank=df_pd.groupby("category")["amount"].rank(ascending=False)
)
result_pd[["category", "amount", "category_avg", "category_rank"]].head()

category
amount
category_avg
category_rank

0
Food
516.653322
499.916417
604342.0

1
Books
937.337226
499.998897
78423.0

2
Electronics
450.941022
499.938189
685881.0

3
Food
674.488081
499.916417
407088.0

4
Food
188.847906
499.916417
1015211.0

Polars:
The over() expression appends the partition to any calculation, avoiding repeated group definitions.
result_pl = df_pl.with_columns([
pl.col("amount").mean().over("category").alias("category_avg"),
pl.col("amount").rank(descending=True).over("category").alias("category_rank")
])
result_pl.select(["category", "amount", "category_avg", "category_rank"]).head()

category
amount
category_avg
category_rank

str
f64
f64
f64

“Food”
516.653322
499.916417
604342.0

“Books”
937.337226
499.998897
78423.0

“Electronics”
450.941022
499.938189
685881.0

“Food”
674.488081
499.916417
407088.0

“Food”
188.847906
499.916417
1015211.0

DuckDB:
SQL window functions with OVER (PARTITION BY …) are the industry standard for this type of calculation.
result_duckdb = duckdb.sql("""
SELECT category, amount,
AVG(amount) OVER (PARTITION BY category) as category_avg,
RANK() OVER (PARTITION BY category ORDER BY amount DESC) as category_rank
FROM df_pd
""").df()
result_duckdb.head()

category
amount
category_avg
category_rank

0
Clothing
513.807166
500.139837
608257

1
Clothing
513.806596
500.139837
608258

2
Clothing
513.806515
500.139837
608259

3
Clothing
513.806063
500.139837
608260

4
Clothing
513.806056
500.139837
608261

Data Loading Performance
pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.
pandas
Single-threaded CSV parsing loads data sequentially.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → … → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

pandas_time = %timeit -o pd.read_csv("sales_data.csv")

1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars
Multi-threaded parsing distributes file reading across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

polars_time = %timeit -o pl.read_csv("sales_data.csv")

137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB
Similar to Polars, file reading is distributed across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()

762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")

Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas

While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.
Query Optimization
pandas: No Optimization
pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows → 10M rows in memory │
│ Step 2: Filter (amount > 100) → 5M rows in memory │
│ Step 3: GroupBy → New DataFrame │
│ Step 4: Mean → Final result │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high – stores all intermediates)

def pandas_query():
return (
pd.read_csv("sales_data.csv")
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_opt_time = %timeit -o pandas_query()

1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

Full CSV load: All rows are read before filtering
No predicate pushdown: Rows are filtered after loading the entire file into memory
No projection pushdown: All columns are loaded, even unused ones

You can manually add usecols to load fewer columns:
def pandas_query_optimized():
return (
pd.read_csv("sales_data.csv", usecols=["category", "amount"])
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_usecols_time = %timeit -o pandas_query_optimized()

1.06 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is faster, but has two drawbacks:

Manual tracking: You must specify columns yourself; change the query, update usecols
No row filtering: All rows still load before the filter applies

Polars and DuckDB handle both automatically by analyzing your query before execution.
Polars: Lazy Evaluation
Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ scan_csv → filter → group_by → agg │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (filter during scan) │
│ • Projection pushdown (read only needed columns) │
│ • Multi-threaded execution (parallel across CPU cores) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – no intermediate DataFrames)

query_pl = (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
)

# View the optimized query plan
print(query_pl.explain())

AGGREGATE[maintain_order: false]
[col("amount").mean().alias("avg_amount")] BY [col("category")]
FROM
Csv SCAN [sales_data.csv] [id: 4687118704]
PROJECT 2/4 COLUMNS
SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

Predicate pushdown: SELECTION filters during scan, not after loading
Projection pushdown: PROJECT 2/4 COLUMNS reads only what’s needed
Operation reordering: Aggregate runs on filtered data, not the full dataset

Execute the optimized query:
def polars_query():
return (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
.collect()
)

polars_opt_time = %timeit -o polars_query()

148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer
DuckDB’s SQL optimizer applies similar optimizations automatically:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ SQL → Parser → Optimizer → Execution Plan │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (WHERE during scan) │
│ • Projection pushdown (SELECT only needed columns) │
│ • Vectorized execution (process 1024 rows per batch) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – streaming execution)

def duckdb_query():
return duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
WHERE amount > 100
GROUP BY category
""").df()

duckdb_opt_time = %timeit -o duckdb_query()

245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:
print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")

Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas

Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.
GroupBy Performance
Computing aggregates requires scanning every row, a workload that scales linearly with CPU cores. This makes groupby operations the clearest test of parallel execution.
Let’s load the data for the groupby benchmarks:
# Load data for fair comparison
df_pd = pd.read_csv("sales_data.csv")
df_pl = pl.read_csv("sales_data.csv")

pandas: Single-Threaded
pandas processes groupby operations on a single CPU core, which becomes a bottleneck on large datasets.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Group A → Group B → Group C → Group D → … → Aggregate │ │
│ └─────────────────────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────────────────────┘

def pandas_groupby():
return df_pd.groupby("category")["amount"].mean()

pandas_groupby_time = %timeit -o pandas_groupby()

271 ms ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars: Multi-Threaded
Polars splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def polars_groupby():
return df_pl.group_by("category").agg(pl.col("amount").mean())

polars_groupby_time = %timeit -o polars_groupby()

31.1 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def duckdb_groupby():
return duckdb.sql("""
SELECT category, AVG(amount)
FROM df_pd
GROUP BY category
""").df()

duckdb_groupby_time = %timeit -o duckdb_groupby()

29 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

print(f"Polars is {pandas_groupby_time.average / polars_groupby_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_groupby_time.average / duckdb_groupby_time.average:.1f}× faster than pandas")

Polars is 8.7× faster than pandas
DuckDB is 9.4× faster than pandas

DuckDB and Polars perform similarly (9.4× vs 8.7×), both leveraging parallel execution. DuckDB’s slight edge comes from late materialization and vector-at-a-time pipelined execution, which avoids creating intermediate results that Polars may still materialize for some operations.
Memory Efficiency
pandas: Full Memory Load
pandas loads the entire dataset into RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████│ │
│ │██████████████████ ALL 10M ROWS ████████████████████████│ │
│ │████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 707,495 KB (entire dataset in memory) │
└─────────────────────────────────────────────────────────────┘

df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")

pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.
Polars: Streaming Mode
Polars can process data in streaming mode, handling chunks without loading everything:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │█ │ │
│ │ (result only) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 0.06 KB (streams chunks, keeps only result) │
└─────────────────────────────────────────────────────────────┘

result_pl_stream = (
pl.scan_csv("sales_data.csv")
.group_by("category")
.agg(pl.col("amount").mean())
.collect(streaming=True)
)

polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")

Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:
(
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 500)
.sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk
DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM Disk (if needed) │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │█ │ │░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ (up to 500MB) │ → │ (overflow here) │ │
│ │ │ │ │ │
│ └──────────────────────────┘ └──────────────────────┘ │
│ Usage: 0.42 KB (spills to disk when RAM full) │
└─────────────────────────────────────────────────────────────┘

# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")

# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
GROUP BY category
""").df()

duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")

DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.
print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")

pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)

Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.
Join Operations
Joining tables is one of the most common operations in data analysis. Let’s compare how each tool handles a left join between 1 million orders and 100K customers.
Let’s create two tables for join benchmarking:
# Create orders table (1M rows)
orders_pd = pd.DataFrame({
"order_id": range(1_000_000),
"customer_id": np.random.randint(1, 100_000, size=1_000_000),
"amount": np.random.rand(1_000_000) * 500
})

# Create customers table (100K rows)
customers_pd = pd.DataFrame({
"customer_id": range(100_000),
"region": np.random.choice(["North", "South", "East", "West"], size=100_000)
})

# Convert to Polars
orders_pl = pl.from_pandas(orders_pd)
customers_pl = pl.from_pandas(customers_pd)

pandas: Single-Threaded
pandas processes the join on a single CPU core.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Row 1 → Row 2 → Row 3 → … → Row 1M │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

def pandas_join():
return orders_pd.merge(customers_pd, on="customer_id", how="left")

pandas_join_time = %timeit -o pandas_join()

60.4 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Polars: Multi-Threaded
Polars distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def polars_join():
return orders_pl.join(customers_pl, on="customer_id", how="left")

polars_join_time = %timeit -o polars_join()

11.8 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def duckdb_join():
return duckdb.sql("""
SELECT o.*, c.region
FROM orders_pd o
LEFT JOIN customers_pd c ON o.customer_id = c.customer_id
""").df()

duckdb_join_time = %timeit -o duckdb_join()

55.7 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s compare the performance of the joins:
print(f"Polars is {pandas_join_time.average / polars_join_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_join_time.average / duckdb_join_time.average:.1f}× faster than pandas")

Polars is 5.1× faster than pandas
DuckDB is 1.1× faster than pandas

Polars delivers a 5.1× speedup while DuckDB shows only 1.1× improvement. Both tools use multi-threading, but Polars’ join algorithm and native DataFrame output avoid the conversion overhead that DuckDB incurs when returning results via .df().
Interoperability
All three tools work together seamlessly. Use each tool for what it does best in a single pipeline.
pandas DataFrame to DuckDB
Query pandas DataFrames directly with SQL:
df = pd.DataFrame({
"product": ["A", "B", "C"],
"sales": [100, 200, 150]
})

# DuckDB queries pandas DataFrames by variable name
result = duckdb.sql("SELECT * FROM df WHERE sales > 120").df()
print(result)

product sales
0 B 200
1 C 150

Polars to pandas
Convert Polars DataFrames when ML libraries require pandas:
df_polars = pl.DataFrame({
"feature1": [1, 2, 3],
"feature2": [4, 5, 6],
"target": [0, 1, 0]
})

# Convert to pandas for scikit-learn
df_pandas = df_polars.to_pandas()
print(type(df_pandas))

DuckDB to Polars
Get query results as Polars DataFrames:
result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales_data.csv'
GROUP BY category
""").pl()

print(type(result))
print(result)

<class 'polars.dataframe.frame.DataFrame'>
shape: (4, 2)
┌─────────────┬──────────┐
│ category ┆ total │
│ — ┆ — │
│ str ┆ f64 │
╞═════════════╪══════════╡
│ Electronics ┆ 6.2445e8 │
│ Food ┆ 6.2540e8 │
│ Clothing ┆ 6.2539e8 │
│ Books ┆ 6.2475e8 │
└─────────────┴──────────┘

Combined Pipeline Example
Each tool has a distinct strength: DuckDB optimizes SQL queries, Polars parallelizes transformations, and pandas integrates with ML libraries. Combine them in a single pipeline to leverage all three:
# Step 1: DuckDB for initial SQL query
aggregated = duckdb.sql("""
SELECT category, region,
SUM(amount) as total_amount,
COUNT(*) as order_count
FROM 'sales_data.csv'
GROUP BY category, region
""").pl()

# Step 2: Polars for additional transformations
enriched = (
aggregated
.with_columns([
(pl.col("total_amount") / pl.col("order_count")).alias("avg_order_value"),
pl.col("category").str.to_uppercase().alias("category_upper")
])
.filter(pl.col("order_count") > 100000)
)

# Step 3: Convert to pandas for visualization or ML
final_df = enriched.to_pandas()
print(final_df.head())

category region total_amount order_count avg_order_value category_upper
0 Food East 1.563586e+08 312918 499.679004 FOOD
1 Food North 1.563859e+08 312637 500.215456 FOOD
2 Clothing North 1.560532e+08 311891 500.345286 CLOTHING
3 Clothing East 1.565054e+08 312832 500.285907 CLOTHING
4 Food West 1.560994e+08 312662 499.259318 FOOD

📖 Related: For writing functions that work across pandas, Polars, and PySpark without conversion, see Unified DataFrame Functions.

Decision Matrix
No single tool wins in every scenario. Use these tables to choose the right tool for your workflow.
Performance Summary
Benchmark results from 10 million rows on a single machine:

Operation
pandas
Polars
DuckDB

CSV Read (10M rows)
1.05s
137ms
762ms

GroupBy
271ms
31ms
29ms

Join (1M rows)
60ms
12ms
56ms

Memory Usage
707 MB
0.06 KB (streaming)
0.42 KB (spill-to-disk)

Polars leads in CSV reading (7.7× faster than pandas) and joins (5× faster). DuckDB matches Polars in groupby performance and uses the least memory with automatic spill-to-disk.
Feature Comparison
Each tool makes different trade-offs between speed, memory, and ecosystem integration:

Feature
pandas
Polars
DuckDB

Multi-threading
No
Yes
Yes

Lazy evaluation
No
Yes
N/A (SQL)

Query optimization
No
Yes
Yes

Larger-than-RAM
No
Streaming
Spill-to-disk

SQL interface
No
Limited
Native

ML integration
Excellent
Good
Limited

pandas lacks the performance features that make Polars and DuckDB fast, but remains essential for ML workflows. Choose between Polars and DuckDB based on whether you prefer DataFrame chaining or SQL syntax.
Recommendations
The best tool depends on your data size, workflow preferences, and constraints:

Small data (<1M rows): Use pandas for simplicity
Large data (1M-100M rows): Use Polars or DuckDB for 5-10× speedup
SQL-preferred workflow: Use DuckDB
DataFrame-preferred workflow: Use Polars
Memory-constrained: Use Polars (streaming) or DuckDB (spill-to-disk)
ML pipeline integration: Use pandas (convert from Polars/DuckDB as needed)
Production data pipelines: Use Polars (DataFrame) or DuckDB (SQL) based on team preference

Final Thoughts
If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:

Profile first: Find which pandas operations are slow
Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() when needed.
Related Tutorials

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames: Deep comparison of Polars and pandas performance and API differences
A Deep Dive into DuckDB for Data Scientists: Explore DuckDB’s SQL-native analytics with Python integration
Scaling Pandas Workflows with PySpark’s Pandas API: Scale existing pandas code to distributed processing with PySpark
Delta Lake: Transform pandas Prototypes into Production: Add ACID transactions and versioning to your pandas data pipelines

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool Read More »

Newsletter #255: Polars v1.35: Native Rolling Rank for Time Series

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Polars v1.35: Native Rolling Rank for Time Series

Problem
How do you rank values within a rolling window?
For example, you might want to compare today’s sales to the last 3 days to answer: “How does today’s sales compare to the last 3 days?”
Solution
Polars v1.35 introduces rolling_rank() for native window ranking operations.
How it works:

Define a window size (e.g., last 3 values)
Each value gets ranked against others in its window
Rank 1 = lowest, Rank N = highest

This method is useful for tracking performance over time, detecting anomalies, or alerting when metrics underperform.

📖 View Full Article

🧪 Run code

⭐ View GitHub

Coiled: Run Python in the Cloud with One Decorator (Sponsored)

Problem
Imagine you need to run data processing on a file that is larger than your laptop’s RAM. What should you do?
Traditional solutions require buying more RAM, renting expensive cloud VMs, or learning Kubernetes. All of these add complexity and cost.
Solution
Coiled’s serverless functions let you run your Python code on cloud VMs with the memory you need by simply adding a decorator.
Key capabilities:

Use any data framework: pandas, Polars, DuckDB, Dask, and more
Process multiple files in parallel with .map()
Sync local packages to cloud without Docker
Cut costs with spot instances and auto-fallback

📖 View Full Article

🌐 Visit website

📢 ANNOUNCEMENTS

Cyber Monday: 58% Off Production-Ready Data Science
My book Production-Ready Data Science is on sale for Cyber Monday.
Get 58% off the ebook or 10% off the paperback through December 8th.
The book covers everything I’ve learned about taking data science from prototype to production: dependency management, testing, CI/CD, and workflow automation.

→ Get 58% Off Now

☕️ Weekly Finds

codon
[Python Utils]
– A high-performance, zero-overhead, extensible Python compiler with built-in NumPy support

khoj
[LLM]
– Your AI second brain. Self-hostable personal assistant with RAG, semantic search, and support for PDFs, Markdown, Notion, and more

lm-evaluation-harness
[MLOps]
– A framework for few-shot evaluation of language models. Powers Hugging Face’s Open LLM Leaderboard

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #255: Polars v1.35: Native Rolling Rank for Time Series Read More »

Newsletter #254: Pydantic v2.12: Skip Computed Fields During Serialization

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Pydantic v2.12: Skip Computed Fields During Serialization

Problem
By default, Pydantic’s model_dump() serializes computed fields alongside the base fields used to derive them.
This duplicates data and increases API response sizes.
Solution
Pydantic v2.12 adds the exclude_computed_fields parameter to model_dump().
This lets you keep computed fields for internal use while excluding them from API responses.

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

llm-council
[LLM]
– Query multiple LLMs in parallel, anonymize responses, and have them rank each other for better answers

skweak
[ML]
– Build NER models without labeled data using weak supervision for NLP tasks

wrapt
[Python Utils]
– Create transparent decorators, wrappers, and monkey patches in Python

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #254: Pydantic v2.12: Skip Computed Fields During Serialization Read More »

Newsletter #253: Docling: Auto-Annotate PDF Images Locally

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Docling: Auto-Annotate PDF Images Locally

Problem
Images in PDFs like charts, diagrams, and figures are invisible to search and analysis. Manually writing descriptions for hundreds of figures is impractical.
You could use cloud APIs like Gemini or ChatGPT, but that means API costs at scale and your documents leaving your infrastructure.
Solution
Docling runs local vision language models (Granite Vision, SmolVLM) to automatically generate descriptive annotations for every picture in your documents, keeping data private.
Key benefits:

Privacy: Data stays local, works offline
Cost: No per-image API fees
Flexibility: Customizable prompts, any HuggingFace model

📖 View Full Article

🧪 Run code

⭐ View GitHub

Rembg: Remove Image Backgrounds in 2 Lines of Python

Problem
Removing backgrounds from images typically requires Photoshop, online tools, or AI assistants like ChatGPT.
But these options come with subscription costs, upload limits, or privacy concerns with your images on external servers.
Solution
Rembg uses AI models to remove backgrounds locally with just 2 lines of Python.
It’s also open source and compatible with common Python imaging libraries.

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

label-studio
[MLOps]
– Multi-type data labeling and annotation tool with standardized output format

reflex
[Python Utils]
– Build full-stack web apps in pure Python – no JavaScript required

TradingAgents
[LLM]
– Multi-agent LLM financial trading framework

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #253: Docling: Auto-Annotate PDF Images Locally Read More »

Newsletter #252: Build Fast Recommendations with Annoy’s Memory-Mapped Indexes

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Fast Recommendations with Annoy’s Memory-Mapped Indexes

Problem
sklearn loads all your item vectors into memory and compares your search vector against every single item in your dataset.
This can take seconds or minutes when you have millions of items.
Solution
Annoy (Approximate Nearest Neighbors Oh Yeah), built by Spotify, speeds up similarity search by organizing your vectors into a searchable tree structure.
How it works:

Pre-builds indexes with “build(n_trees)”, creating multiple trees by recursively splitting your vector space with random hyperplanes
Traverses tree splits to find the n nearest neighbors using “get_nns_by_item(i, n)”
Checks only items in the final region instead of scanning everything

As a result, you can query millions of items in milliseconds instead of seconds.

🧪 Run code

⭐ View GitHub

Build Reliable DataFrame Tests with assert_frame_equal

Problem
Testing numerical code with regular assertions can lead to false failures from floating-point precision.
Your perfectly correct function fails tests because 0.1 + 0.2 doesn’t exactly equal 0.3 in computer arithmetic.
Solution
Use numpy.testing and pandas.testing utilities for robust numerical comparisons.
Key approaches:

assert_array_almost_equal for NumPy arrays with decimal precision control
pd.testing.assert_frame_equal for DataFrame comparisons with tolerance
Handle floating-point arithmetic limitations properly
Get reliable test results for numerical data processing

Professional data science requires proper numerical testing methods.

📖 Learn more

🧪 Run code

☕️ Weekly Finds

sympy
[Python Utils]
– A computer algebra system written in pure Python for symbolic mathematics

qdrant
[LLM]
– High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

mindsdb
[ML]
– Federated query engine for AI – connect to hundreds of data sources and generate intelligent responses using built-in agents

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #252: Build Fast Recommendations with Annoy’s Memory-Mapped Indexes Read More »

Coiled: Scale Python Data Pipeline to the Cloud in Minutes

Leave a Comment / Blog, Data Engineer, MLOps / Khuyen Tran

Table of Contents

Introduction
What is Coiled?
Setup
Serverless Functions: Process Data with Any Framework
When You Need More: Distributed Clusters with Dask
Cost Optimization
Environment Synchronization
Conclusion

Introduction
A common challenge data scientists face is that local machines simply can’t handle large-scale datasets. Once your analysis reaches the 50GB+ range, you’re pushed into difficult choices:

Sample your data and hope patterns hold
Buy more RAM or rent expensive cloud VMs
Learn Kubernetes and spend days configuring clusters
Build Docker images and manage container registries

Each option adds complexity, cost, or compromises your analysis quality.
Coiled eliminates these tradeoffs. It provisions emphemeral compute clusters on AWS, GCP, or Azure using simple Python APIs. You get distributed computing power without DevOps expertise, automatic environment synchronization without Docker, and 70% cost savings through smart spot instance management.
In this article, you’ll learn how to scale Python data workflows to the cloud with Coiled:

Serverless functions: Run pandas, Polars, or DuckDB code on cloud VMs with a simple decorator
Parallel processing: Process multiple files simultaneously across cloud machines
Distributed clusters: Aggregate data across files using managed Dask clusters
Environment sync: Replicate your local packages to the cloud without Docker
Cost optimization: Reduce cloud spending with spot instances and auto-scaling

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

What is Coiled?
Coiled is a lightweight cloud platform that runs Python code on powerful cloud infrastructure without requiring Docker or Kubernetes knowledge. It supports four main capabilities:

Batch Jobs: Submit and run Python scripts asynchronously on cloud infrastructure
Serverless Functions: Execute Python functions (Pandas, Polars, PyTorch) on cloud VMs with decorators
Dask Clusters: Provision multi-worker clusters for distributed computing
Jupyter Notebooks: Launch interactive Jupyter servers directly on cluster schedulers

Key features across both:

Framework-Agnostic: Works with Pandas, Polars, Dask, or any Python library
Automatic Package Sync: Local packages replicate to cloud workers without Docker
Cost Optimization: Spot instances, adaptive scaling, and auto-shutdown reduce spending
Simple APIs: Decorate functions or create clusters with 2-3 lines of code

To install Coiled and Dask, run:
pip install coiled dask[complete]

Setup
First, create a free Coiled account by running this command in your terminal:
coiled login

This creates a free Coiled Hosted account with 200 CPU-hours per month. Your code runs on Coiled’s cloud infrastructure with no AWS/GCP/Azure account needed.
Note: For production workloads with your own cloud account, run coiled setup PROVIDER (setup guide).
Serverless Functions: Process Data with Any Framework
The simplest way to scale Python code to the cloud is with serverless functions. Decorate any function with @coiled.function, and Coiled handles provisioning cloud VMs, installing packages, and executing your code.
Scale Beyond Laptop Memory with Cloud VMs
Imagine you need to process a NYCTaxi dataset with 12GB compressed files (50GB+ expanded) on a laptop with only 16GB of RAM. Your machine simply doesn’t have enough memory to handle this workload.
With Coiled, you can run the exact same code on a cloud VM with 64GB of RAM by simply adding the @coiled.function decorator.
import coiled
import pandas as pd

@coiled.function(
memory="64 GiB",
region="us-east-1"
)
def process_month_with_pandas(month):
# Read 12GB file directly into pandas
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)

# Compute tipping patterns by hour
df["hour"] = df["tpep_pickup_datetime"].dt.hour
result = df.groupby("hour")["tip_amount"].mean()

return result

# Run on cloud VM with 64GB RAM
january_tips = process_month_with_pandas(1)
print(january_tips)

Output:
hour
0 3.326543
1 2.933899
2 2.768246
3 2.816333
4 3.132973
…
Name: tip_amount, dtype: float64

This function runs on a cloud VM with 64GB RAM, processes the entire month in memory, and returns just the aggregated result to your laptop.
You can view the function’s execution progress and resource usage in the Coiled dashboard.

Parallel Processing with .map()
By default Coiled Functions will run sequentially, just like normal Python functions. However, they can also easily run in parallel by using the .map() method.
Process all 12 months in parallel using .map():
import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_month(month):
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return df["tip_amount"].mean()

# Process 12 months in parallel on 12 cloud VMs
months = range(1, 13)
monthly_tips = list(process_month.map(months))

print("Average tips by month:", monthly_tips)

Output:
Average tips by month: [2.65, 2.58, 2.72, 2.68, 2.75, 2.81, 2.79, 2.73, 2.69, 2.71, 2.66, 2.62]

When you call .map() with 12 months, Coiled spins up 12 cloud VMs simultaneously, runs process_month() on each VM with a different month, then returns all results.
The execution flow:
VM 1: yellow_tripdata_2024-01.parquet → compute mean → 2.65
VM 2: yellow_tripdata_2024-02.parquet → compute mean → 2.58
VM 3: yellow_tripdata_2024-03.parquet → compute mean → 2.72
… (all running in parallel)
VM 12: yellow_tripdata_2024-12.parquet → compute mean → 2.62
↓
Coiled collects: [2.65, 2.58, 2.72, …, 2.62]

Each VM works in complete isolation with no data sharing or coordination between them.

The dashboard confirms 12 tasks were executed, matching the 12 months we passed to .map().
Framework-Agnostic: Use Any Python Library
Coiled Functions aren’t limited to pandas. You can use any Python library (Polars, DuckDB, PyTorch, scikit-learn) without any additional configuration. The automatic package synchronization works for all dependencies.
Example with Polars:
Polars is a fast DataFrame library optimized for performance. It works seamlessly with Coiled:
import coiled
import polars as pl

@coiled.function(memory="64 GiB", region="us-east-1")
def process_with_polars(month):
df = pl.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return (
df
.filter(pl.col("tip_amount") > 0)
.group_by("PULocationID")
.agg(pl.col("tip_amount").mean())
.sort("tip_amount", descending=True)
.head(5)
)

result = process_with_polars(1)
print(result)

Output:
shape: (5, 2)
┌──────────────┬────────────┐
│ PULocationID ┆ tip_amount │
│ — ┆ — │
│ i64 ┆ f64 │
╞══════════════╪════════════╡
│ 138 ┆ 4.52 │
│ 230 ┆ 4.23 │
│ 161 ┆ 4.15 │
│ 234 ┆ 3.98 │
│ 162 ┆ 3.87 │
└──────────────┴────────────┘

Example with DuckDB:
DuckDB provides fast SQL analytics directly on Parquet files:
import coiled
import duckdb

@coiled.function(memory="64 GiB", region="us-east-1")
def query_with_duckdb(month):
con = duckdb.connect()
result = con.execute(f"""
SELECT
DATE_TRUNC('hour', tpep_pickup_datetime) as pickup_hour,
AVG(tip_amount) as avg_tip,
COUNT(*) as trip_count
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet'
WHERE tip_amount > 0
GROUP BY pickup_hour
ORDER BY avg_tip DESC
LIMIT 5
""").fetchdf()
return result

result = query_with_duckdb(1)
print(result)

Output:
pickup_hour avg_tip trip_count
0 2024-01-15 14:00:00 4.23 15234
1 2024-01-20 18:00:00 4.15 18456
2 2024-01-08 12:00:00 3.98 12789
3 2024-01-25 16:00:00 3.87 14567
4 2024-01-12 20:00:00 3.76 16234

Coiled automatically detects your local Polars and DuckDB installations and replicates them to cloud VMs. No manual configuration needed.
When You Need More: Distributed Clusters with Dask
Serverless functions work great for independent file processing. However, when you need to combine and aggregate data across all your files into a single result, you need a Dask cluster.
For example, suppose you want to calculate total revenue by pickup location across all 12 months of data. With Coiled Functions, each VM processes one month independently:
@coiled.function(memory="64 GiB", region="us-east-1")
def get_monthly_revenue_by_location(month):
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return df.groupby("PULocationID")["total_amount"].sum()

# This returns 12 separate DataFrames, one per month
results = list(get_monthly_revenue_by_location.map(range(1, 13)))
print(f'Number of DataFrames: {len(results)}')

Output:
Number of DataFrames: 12

The problem is that you get 12 separate DataFrames that you need to manually combine.
Here’s what happens: VM 1 processes January and returns a DataFrame like:
PULocationID total_amount
138 15000
230 22000

VM 2 processes February and returns:
PULocationID total_amount
138 18000
230 19000

Each VM works independently and has no knowledge of the other months’ data. To get yearly totals per location, you’d need to write code to merge these 12 DataFrames and sum the revenue for each location.
With a Dask cluster, workers coordinate to give you one global result:
import coiled
import dask.dataframe as dd

# For production workloads, you can scale to 50+ workers
cluster = coiled.Cluster(n_workers=3, region="us-east-1")

# Read all 12 months of 2024 data
files = [
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
for month in range(1, 13)
]
df = dd.read_parquet(files) # Lazy: builds a plan, doesn't load data yet

# This returns ONE DataFrame with total revenue per location across all months
total_revenue = (
df.groupby("PULocationID")["total_amount"].sum().compute()
) # Executes the plan

total_revenue.head()

Output:
PULocationID
1 563645.70
2 3585.80
3 67261.41
4 1687265.08
5 602.98
Name: total_amount, dtype: float64

You can see that we got a single DataFrame with the total revenue per location across all months.
Here is what happens under the hood:
When you call .compute(), Dask executes the plan in four steps:
Step 1: Data Distribution
├─ Worker 1: [Jan partitions 1-3, Apr partitions 1-2, Jul partitions 1-3]
├─ Worker 2: [Feb partitions 1-4, May partitions 1-3, Aug partitions 1-2]
└─ Worker 3: [Mar partitions 1-3, Jun partitions 1-2, Sep-Dec partitions]

Step 2: Local Aggregation (each worker groups its data)
├─ Worker 1: {location_138: $45,000, location_230: $63,000}
├─ Worker 2: {location_138: $38,000, location_230: $55,000}
└─ Worker 3: {location_138: $50,000, location_230: $62,000}

Step 3: Shuffle (redistribute so each location lives on one worker)
├─ Worker 1: All location_230 data → $63,000 + $55,000 + $62,000
├─ Worker 2: All location_138 data → $45,000 + $38,000 + $50,000
└─ Worker 3: All other locations…

Step 4: Final Result
location_138: $133,000 (yearly total)
location_230: $180,000 (yearly total)

This shuffle-and-combine process is what makes Dask different from Coiled Functions. Workers actively coordinate and share data to produce one unified result.
Cost Optimization
Cloud costs can spiral quickly. Coiled provides three mechanisms to reduce spending:
1. Spot Instances
You can reduce cloud costs by 60-90% using spot instances. These are discounted servers that cloud providers can reclaim when demand increases. When an interruption occurs, Coiled:

Gracefully shuts down the affected worker
Redistributes its work to healthy workers
Automatically launches a replacement worker

cluster = coiled.Cluster(
n_workers=50,
spot_policy="spot_with_fallback", # Use spot instances with on-demand backup
region="us-east-1"
)

Cost comparison for m5.xlarge instances:

On-demand: $0.192/hour
Spot: $0.05/hour
Savings: 74%

For a 100-worker cluster:

On-demand: $19.20/hour = $460/day
Spot: $5.00/hour = $120/day

2. Adaptive Scaling
Adaptive scaling automatically adds workers when you have more work and removes them when idle, so you only pay for what you need. Coiled enables this with the adapt() method:
cluster = coiled.Cluster(region="us-east-1")
cluster.adapt(minimum=10, maximum=50) # Scale between 10-50 workers

Serverless functions also support auto-scaling by specifying a worker range:
@coiled.function(n_workers=[10, 300])
def process_data(files):
return results

This saves money during light workloads while delivering performance during heavy computation. No manual monitoring required.
3. Automatic Shutdown
To prevent paying for unused resources, Coiled automatically shuts down clusters after 20 minutes of inactivity by default. You can customize this with the idle_timeout parameter:
cluster = coiled.Cluster(
n_workers=20,
region="us-east-1",
idle_timeout="1 hour" # Keep cluster alive for longer workloads
)

This prevents the common mistake of leaving clusters running overnight.
Environment Synchronization
The “Works on My Machine” Problem When Scaling to Cloud
Imagine this scenario: your pandas code works locally but fails on a cloud VM because the environment has different package versions or missing dependencies.
Docker solves this by packaging your environment into a container that runs identically on your laptop and cloud VMs. However, getting it running on cloud infrastructure involves a complex workflow:

Write a Dockerfile listing all dependencies and versions
Build the Docker image (wait 5-10 minutes)
Push to cloud container registry (AWS ECR, Google Container Registry)
Configure cloud VMs (EC2/GCE instances with proper networking and security)
Pull and run the image on cloud machines (3-5 minutes per VM)
Rebuild and redeploy every time you add a package (repeat steps 2-5)

This Docker + cloud workflow slows down development and requires expertise in both containerization and cloud infrastructure management.
Coiled’s Solution: Automatic Package Synchronization
Coiled eliminates Docker entirely through automatic package synchronization. Your local environment replicates to cloud workers with no Dockerfile required.
Instead of managing Docker images and cloud infrastructure, you simply add a decorator to your function:
import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_data():
df = pd.read_parquet("s3://my-bucket/data.parquet")
# Your analysis code here
return df.describe()

result = process_data() # Runs on cloud VM with your exact package versions

What Coiled does automatically:

Scans your local environment (pip, conda packages with exact versions)
Creates a dependency manifest (a list of all packages and their versions)
Installs packages on cloud workers with matching versions
Reuses built environments when your dependencies haven’t changed

This is faster than Docker builds in most cases thanks to intelligent caching, and requires zero configuration.
Conclusion
Coiled transforms cloud computing from a multi-day DevOps project into simple Python operations. Whether you’re processing a single large file with Pandas, querying multiple files with Polars, or running distributed aggregations with Dask clusters, Coiled provides the right scaling approach for your needs.
I recommend starting simple with serverless functions for single-file processing, then scale to Dask clusters when you need truly distributed computing. Coiled removes the infrastructure burden from data science workflows, letting you focus on analysis instead of operations.
Related Tutorials

Polars vs Pandas: Performance Benchmarks and When to Switch – Choose the right DataFrame library for your workloads
DuckDB: Fast SQL Analytics on Parquet Files – Master SQL techniques for processing data with DuckDB
PySpark Pandas API: Familiar Syntax at Scale – Explore Spark as an alternative to Dask for distributed processing

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Coiled: Scale Python Data Pipeline to the Cloud in Minutes Read More »

Newsletter #251: PySpark 4.0: Native Plotting API for DataFrames

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

PySpark 4.0: Native Plotting API for DataFrames

Problem
Visualizing PySpark DataFrames typically requires converting to Pandas first, adding memory overhead and extra processing steps.
Solution
PySpark 4.0 adds native Plotly-powered plotting, enabling direct .plot() calls on DataFrames without Pandas conversion.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

rembg
[Python Utils]
– Rembg is a tool to remove images background

pyupgrade
[Python Utils]
– A tool (and pre-commit hook) to automatically upgrade syntax for newer versions of the language

py-shiny
[Data Viz]
– Shiny for Python is the best way to build fast, beautiful web applications in Python

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #251: PySpark 4.0: Native Plotting API for DataFrames Read More »

PySpark 4.0: 4 Features That Change How You Process Data

2 Comments / Blog, Python Utilities / Khuyen Tran

Table of Contents

Introduction
From Pandas UDFs to Arrow UDFs: Next-Gen Performance
Native Data Visualization (PySpark 4.0+)
Schema-Free JSON Handling with Variant Type (PySpark 4.0+)
Dynamic Schema Generation with UDTF analyze() (PySpark 4.0+)
Conclusion

Introduction
PySpark 4.0 introduces transformative improvements that enhance performance, streamline workflows, and enable flexible data transformations in distributed processing.
This release delivers four key enhancements:

Arrow-optimized UDFs accelerate custom transformations by operating directly on Arrow data structures, eliminating the serialization overhead of Pandas UDFs.
Native Plotly visualization enables direct DataFrame plotting without conversion, streamlining exploratory data analysis and reducing memory overhead.
Variant type for JSON enables schema-free JSON querying with JSONPath syntax, eliminating verbose StructType definitions for nested data.
Dynamic schema UDTFs adapt output columns to match input data at runtime, enabling flexible pivot tables and aggregations where column structure depends on data values.

For comprehensive coverage of core PySpark SQL functionality, see the Complete Guide to PySpark SQL.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

From Pandas UDFs to Arrow UDFs: Next-Gen Performance
The pandas_udf function requires converting Arrow data to Pandas format and back again for each operation. This serialization cost becomes significant when processing large datasets.
PySpark 3.5+ introduces Arrow-optimized UDFs via the useArrow=True parameter, which operates directly on Arrow data structures, avoiding the Pandas conversion entirely and improving performance.
Let’s compare the performance with a weighted sum calculation across multiple columns on 100,000 rows:
import pandas as pd
import pyarrow.compute as pc
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, udf
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("UDFComparison").getOrCreate()

# Create test data with multiple numeric columns
data = [(float(i), float(i*2), float(i*3)) for i in range(100000)]
df = spark.createDataFrame(data, ["val1", "val2", "val3"])

Create a timing decorator to measure the execution time of the functions:
import time
from functools import wraps

# Timing decorator
def timer(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() – start
print(f"{func.__name__}: {elapsed:.2f}s")
wrapper.elapsed_time = elapsed
return result

return wrapper

Use the timing decorator to measure the execution time of the pandas_udf function:
@pandas_udf(DoubleType())
def weighted_sum_pandas(v1: pd.Series, v2: pd.Series, v3: pd.Series) -> pd.Series:
return v1 * 0.5 + v2 * 0.3 + v3 * 0.2

@timer
def run_pandas_udf():
result = df.select(
weighted_sum_pandas(df.val1, df.val2, df.val3).alias("weighted")
)
result.count() # Trigger computation
return result

result_pandas = run_pandas_udf()
pandas_time = run_pandas_udf.elapsed_time

run_pandas_udf: 1.33s

Use the timing decorator to measure the execution time of the Arrow-optimized UDF using useArrow:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType(), useArrow=True)
def weighted_sum_arrow(v1, v2, v3):
term1 = pc.multiply(v1, 0.5)
term2 = pc.multiply(v2, 0.3)
term3 = pc.multiply(v3, 0.2)
return pc.add(pc.add(term1, term2), term3)

@timer
def run_arrow_udf():
result = df.select(
weighted_sum_arrow(df.val1, df.val2, df.val3).alias("weighted")
)
result.count() # Trigger computation
return result

result_arrow = run_arrow_udf()
arrow_time = run_arrow_udf.elapsed_time

run_arrow_udf: 0.43s

Measure the speedup:
speedup = pandas_time / arrow_time
print(f"Speedup: {speedup:.2f}x faster")

Speedup: 3.06x faster

The output shows that the Arrow-optimized version is 3.06x faster than the pandas_udf version!
The performance gain comes from avoiding serialization. Arrow-optimized UDFs use PyArrow compute functions like pc.multiply() and pc.add() directly on Arrow data, while pandas_udf must convert each column to Pandas and back.
Trade-off: The 3.06x performance improvement comes at the cost of using PyArrow’s less familiar compute API instead of Pandas operations. However, this becomes increasingly valuable as dataset size and column count grow.
Native Data Visualization (PySpark 4.0+)
Visualizing PySpark DataFrames traditionally requires converting to Pandas first, then using external libraries like matplotlib or plotly. This adds memory overhead and extra processing steps.
PySpark 4.0 introduces a native plotting API powered by Plotly, enabling direct visualization from PySpark DataFrames without any conversion.
Let’s visualize sales data across product categories:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Visualization").getOrCreate()

# Create sample sales data
sales_data = [
("Electronics", 5000, 1200),
("Electronics", 7000, 1800),
("Clothing", 3000, 800),
("Clothing", 4500, 1100),
("Furniture", 6000, 1500),
("Furniture", 8000, 2000),
]

sales_df = spark.createDataFrame(sales_data, ["category", "sales", "profit"])
sales_df.show()

+———–+—–+——+
| category|sales|profit|
+———–+—–+——+
|Electronics| 5000| 1200|
|Electronics| 7000| 1800|
| Clothing| 3000| 800|
| Clothing| 4500| 1100|
| Furniture| 6000| 1500|
| Furniture| 8000| 2000|
+———–+—–+——+

Create a scatter plot directly from the PySpark DataFrame using the .plot() method:
# Direct plotting without conversion
sales_df.plot(kind="scatter", x="sales", y="profit", color="category")

You can also use shorthand methods such as plot.scatter() and plot.bar() for specific chart types:
# Scatter plot with shorthand
sales_df.plot.scatter(x="sales", y="profit", color="category")

# Bar chart by category
category_totals = sales_df.groupBy("category").agg({"sales": "sum"}).withColumnRenamed("sum(sales)", "total_sales")
category_totals.plot.bar(x="category", y="total_sales")

The native plotting API supports 8 chart types:

scatter: Scatter plots with color grouping
bar: Bar charts for categorical comparisons
line: Line plots for time series
area: Area charts for cumulative values
pie: Pie charts for proportions
box: Box plots for distributions
histogram: Histograms for frequency analysis
kde/density: Density plots for probability distributions

By default, PySpark visualizes up to 1,000 rows. For larger datasets, configure the limit:
# Increase visualization row limit
spark.conf.set("spark.sql.pyspark.plotting.max_rows", 5000)

Schema-Free JSON Handling with Variant Type (PySpark 4.0+)
Extracting nested JSON in PySpark requires defining StructType schemas that mirror your data structure. This creates verbose code that breaks whenever your JSON changes.
Let’s extract data from a 3-level nested JSON structure using the traditional approach:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("json_schema").getOrCreate()

# 3 levels of nested StructType – verbose and hard to maintain
schema = StructType([
StructField("user", StructType([
StructField("name", StringType()),
StructField("profile", StructType([
StructField("settings", StructType([
StructField("theme", StringType())
]))
]))
]))
])

json_data = [
'{"user": {"name": "Alice", "profile": {"settings": {"theme": "dark"}}}}',
'{"user": {"name": "Bob", "profile": {"settings": {"theme": "light"}}}}'
]

rdd = spark.sparkContext.parallelize(json_data)
df = spark.read.schema(schema).json(rdd)
df.select("user.name", "user.profile.settings.theme").show()

+—–+—–+
| name|theme|
+—–+—–+
|Alice| dark|
| Bob|light|
+—–+—–+

PySpark 4.0 introduces the Variant type, which lets you skip schema definitions entirely. To work with the Variant type:

Use parse_json() to load JSON data
Use variant_get() to extract fields with JSONPath syntax

from pyspark.sql import SparkSession
from pyspark.sql.functions import parse_json, variant_get

spark = SparkSession.builder.appName("json_variant").getOrCreate()

json_data = [
('{"user": {"name": "Alice", "profile": {"settings": {"theme": "dark"}}}}',),
('{"user": {"name": "Bob", "profile": {"settings": {"theme": "light"}}}}',)
]

df = spark.createDataFrame(json_data, ["json_str"])
df_variant = df.select(parse_json("json_str").alias("data"))

# No schema needed – just use JSONPath
result = df_variant.select(
variant_get("data", "$.user.name", "string").alias("name"),
variant_get("data", "$.user.profile.settings.theme", "string").alias("theme")
)
result.show()

+—–+—–+
| name|theme|
+—–+—–+
|Alice| dark|
| Bob|light|
+—–+—–+

The Variant type provides several advantages:

No upfront schema definition: Handle any JSON structure without verbose StructType definitions
JSONPath syntax: Access nested paths using $.path.to.field notation regardless of depth
Schema flexibility: JSON structure changes don’t break your code
Type safety: variant_get() lets you specify the expected type when extracting fields

Dynamic Schema Generation with UDTF analyze() (PySpark 4.0+)
Python UDTFs (User-Defined Table Functions) generate multiple rows from a single input row, but they come with a critical limitation: you must define the output schema upfront. When your output columns depend on the input data itself (like creating pivot tables or dynamic aggregations where column names come from data values), this rigid schema requirement becomes a problem.
For example, a word-counting UDTF requires you to specify all output columns upfront, even though the words themselves are unknown until runtime.
from pyspark.sql.functions import udtf, lit
from pyspark.sql.types import StructType, StructField, IntegerType

# Schema must be defined upfront with fixed column names
@udtf(returnType=StructType([
StructField("hello", IntegerType()),
StructField("world", IntegerType()),
StructField("spark", IntegerType())
]))
class StaticWordCountUDTF:
def eval(self, text: str):
words = text.split(" ")
yield tuple(words.count(word) for word in ["hello", "world", "spark"])

# Only works for exactly these three words
result = StaticWordCountUDTF(lit("hello world hello spark"))
result.show()

+—–+—–+—–+
|hello|world|spark|
+—–+—–+—–+
| 2| 1| 1|
+—–+—–+—–+

If the input text contains a different set of words, the output won’t contain the count of the new words.
result = StaticWordCountUDTF(lit("hi world hello spark"))
result.show()

+—–+—–+—–+
|hello|world|spark|
+—–+—–+—–+
| 1| 1| 1|
+—–+—–+—–+

PySpark 4.0 introduces the analyze() method for UDTFs, enabling dynamic schema determination based on input data. Instead of hardcoding your output schema, analyze() inspects the input and generates the appropriate columns at runtime.
from pyspark.sql.functions import udtf, lit
from pyspark.sql.types import StructType, IntegerType
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult

@udtf
class DynamicWordCountUDTF:
@staticmethod
def analyze(text: AnalyzeArgument) -> AnalyzeResult:
"""Dynamically create schema based on input text"""
schema = StructType()
# Create one column per unique word in the input
for word in sorted(set(text.value.split(" "))):
schema = schema.add(word, IntegerType())
return AnalyzeResult(schema=schema)

def eval(self, text: str):
"""Generate counts for each word"""
words = text.split(" ")
# Use same logic as analyze() to determine column order
unique_words = sorted(set(words))
yield tuple(words.count(word) for word in unique_words)

# Schema adapts to any input text
result = DynamicWordCountUDTF(lit("hello world hello spark"))
result.show()

+—–+—–+—–+
|hello|spark|world|
+—–+—–+—–+
| 2| 1| 1|
+—–+—–+—–+

Now try with completely different words:
# Different words – schema adapts automatically
result2 = DynamicWordCountUDTF(lit("python data science"))
result2.show()

+—-+——+——-+
|data|python|science|
+—-+——+——-+
| 1| 1| 1|
+—-+——+——-+

The columns change from hello, spark, world to data, python, science without any code modifications.
Conclusion
PySpark 4.0 makes distributed computing faster and easier to use. Arrow-optimized UDFs speed up custom transformations, the Variant type simplifies JSON handling, native visualization removes conversion steps, and dynamic UDTFs handle flexible data structures.
These improvements address real bottlenecks without requiring major code changes, making PySpark more practical for everyday data engineering tasks.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

PySpark 4.0: 4 Features That Change How You Process Data Read More »

python

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives)

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

PySpark 4.0: 4 Features That Change How You Process Data

Drop a line

Get in touch

Follow Us on Social Media

python

Work with Khuyen Tran

Work with Khuyen Tran