...
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Pandas

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings

Table of Contents

Introduction
Setup
Cleaner Column Operations with pd.col
Copy-on-Write Is Now the Default
A Dedicated String Dtype
Final Thoughts

Introduction
pandas 3.0 brings some of the most significant changes to the library in years. This article covers:

pd.col expressions: Cleaner column operations without lambdas
Copy-on-Write: Predictable copy behavior by default
PyArrow-backed strings: Faster operations and better type safety

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Setup
pandas 3.0 requires Python 3.11 or higher. Install it with:
pip install –upgrade pandas

To test these features before upgrading, enable them in pandas 2.3:
pd.options.future.infer_string = True
pd.options.mode.copy_on_write = True

Cleaner Column Operations with pd.col
The Traditional Approaches
If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.
Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:
import pandas as pd

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.
df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified – no way to see the original state
df_original

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(…).query(…).sort_values(…) in one expression instead of multiple separate statements.
df = pd.DataFrame({"temp_c": [0, 20, 30]})

# This doesn't work – square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')

# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df

temp_c
temp_f

1
20
68.0

2
30
86.0

Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
.query('temp_f > 50')
)
df

temp_c
temp_f

1
20
68.0

2
30
86.0

3
100
212.0

This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = lambda df: df['x'] * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
30
30
30

1
2
60
60
60

2
3
90
90
90

What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.
Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.
The pandas 3.0 Solution: pd.col
pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.
Here’s the temp_f conversion rewritten with pd.col:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.
Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = pd.col('x') * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
10
20
30

1
2
20
40
60

2
3
30
60
90

Filtering with Expressions
Traditional filtering repeats df twice:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0] # df appears twice
df

temp_c

1
0

2
15

3
25

4
30

With pd.col, you reference the column directly:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0] # cleaner
df

temp_c

1
0

2
15

3
25

4
30

Combining Multiple Columns
With lambdas, you need to repeat lambda x: x[…] for every column:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=lambda x: x["price"] * x["quantity"],
discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

With pd.col, the same logic is more readable:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=pd.col("price") * pd.col("quantity"),
discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:
# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean()) # Not supported yet

This limitation may be removed in future versions.
Copy-on-Write Is Now the Default
If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:
# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high" # SettingWithCopyWarning!

Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.
pandas 3.0 makes the answer simple: filtering with df[…] always returns a copy. Modifying df2 never affects df.
This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.
Now when you filter and modify, there’s no warning and no uncertainty:
df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})

# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high" # Modifies df2 only, not df

df2

value
status

1
15
high

2
25
high

df

value
status

0
5
low

1
15
low

2
25
low

We can see that df is unchanged and no warning was raised.
Breaking Change: Chained Assignment
One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:
# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})

df["foo"][df["bar"] > 5] = 100
df

foo
bar

0
1
4

1
2
6

2
3
8

Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.
Use .loc instead to modify the original DataFrame:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df

foo
bar

0
1
4

1
100
6

2
100
8

A Dedicated String Dtype
pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:
pd.options.future.infer_string = False # pandas 2.x behavior

text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])

print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")

text dtype: object
messy dtype: object

pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:
pd.options.future.infer_string = True # pandas 3.0 behavior

ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")

dtype: str

Performance Gains
The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:

String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead

Arrow Ecosystem Interoperability
DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:
import polars as pl

pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df) # Zero-copy – data already in Arrow format
polars_df

name

0
alice

1
bob

2
charlie

Final Thoughts
pandas 3.0 brings meaningful improvements to your daily workflow:

Write cleaner code with pd.col expressions instead of lambdas
Avoid SettingWithCopyWarning confusion with Copy-on-Write as the default
Get 5-10x faster string operations with the new PyArrow-backed str dtype
Pass DataFrames to Polars and DuckDB without data conversion

Related Resources
For more on DataFrame tools and performance optimization:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames – Compare pandas with Polars for performance-critical workflows
Scaling Pandas Workflows with PySpark’s Pandas API – Use familiar pandas syntax on distributed data
pandas vs Polars vs DuckDB: A Data Scientist’s Guide – Choose the right tool for your data analysis needs

💡 The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings Read More »

Great Tables: Publication-Ready Tables from Polars and Pandas DataFrames

Table of Contents

Introduction
Introduction to Great Tables
Setup
Value Formatting
Table Structure
Data Coloring
Nanoplots
Conditional Styling
Conclusion

Introduction
Data scientists spend significant time analyzing data, but presenting results professionally remains a challenge.
Raw DataFrames with unformatted numbers, ISO dates, and no visual hierarchy make reports hard to read.
The common workaround is exporting to CSV and formatting in Excel. This is slow, error-prone, and breaks with every data update.
Great Tables solves this problem by letting you create publication-ready tables directly in Python with a single, reproducible script.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Introduction to Great Tables
Great Tables is a Python library for creating publication-quality tables from pandas or Polars DataFrames. It provides:

Value formatting: Transform raw numbers into currencies, percentages, dates, and more
Table structure: Add headers, column spanners, row labels, and source notes
Data-driven coloring: Apply color scales based on cell values
Inline visualizations: Embed sparklines (nanoplots) directly in table cells
Conditional styling: Style cells based on data conditions

Let’s dive deeper into each of these features in the next sections.
Setup
Great Tables works with both pandas and Polars DataFrames. We’ll use Polars in this tutorial:
pip install great_tables polars selenium

Selenium is required for exporting tables as PNG images.

New to Polars? See our Polars vs. Pandas comparison for an introduction.

Great Tables includes built-in sample datasets. We’ll use the sp500 dataset containing historical S&P 500 stock data:
from great_tables import GT
from great_tables.data import sp500
import polars as pl

# Preview the raw data
sp500_df = pl.from_pandas(sp500)
print(sp500_df.head(5))

date
open
high
low
close
volume
adj_close

2015-12-31
2060.5901
2062.54
2043.62
2043.9399
2.6553e9
2043.9399

2015-12-30
2077.3401
2077.3401
2061.97
2063.3601
2.3674e9
2063.3601

2015-12-29
2060.54
2081.5601
2060.54
2078.3601
2.5420e9
2078.3601

2015-12-28
2057.77
2057.77
2044.2
2056.5
2.4925e9
2056.5

2015-12-24
2063.52
2067.3601
2058.73
2060.99
1.4119e9
2060.99

The raw output shows:

Unformatted decimals (e.g., 2060.5901)
Large integers without separators (e.g., 2.6553e9)
Dates as plain strings (e.g., “2015-12-31”)

Let’s transform this into a readable table.
Value Formatting
Great Tables provides fmt_* methods to format values. Here’s how to format currencies, numbers, and dates:
from great_tables import GT
from great_tables.data import sp500

# Filter to a specific date range
start_date = "2010-06-07"
end_date = "2010-06-14"
sp500_mini = sp500[(sp500["date"] >= start_date) & (sp500["date"] <= end_date)]

stock_price_table = (
GT(sp500_mini)
.fmt_currency(columns=["open", "high", "low", "close"])
.fmt_date(columns="date", date_style="wd_m_day_year")
.fmt_number(columns="volume", compact=True)
.cols_hide(columns="adj_close")
)
stock_price_table

In this example:

fmt_currency() adds dollar signs and formats decimals (e.g., $1,065.84)
fmt_date() converts date strings to readable format (e.g., “Mon, Jun 7, 2010”)
fmt_number() with compact=True converts large numbers to compact format (e.g., 5.47B)
cols_hide() removes the redundant adj_close column

To export the table for reports, use the save() method:
stock_price_table.save("stock_price_table.png") # Supports .png, .bmp, .pdf

Formatting Percentages
Use fmt_percent() to display decimal values as percentages. Here’s some sample data with decimal values:
import polars as pl
from great_tables import GT

performance_data = pl.DataFrame({
"metric": ["Revenue Growth", "Profit Margin", "Market Share"],
"q1": [0.12, 0.08, 0.23],
"q2": [0.15, 0.09, 0.25],
"q3": [0.11, 0.07, 0.24]
})
performance_data

The raw decimals are hard to read at a glance. Let’s format them as percentages:
percent_table = (
GT(performance_data, rowname_col="metric")
.fmt_percent(columns=["q1", "q2", "q3"], decimals=1)
)
percent_table

The percentages are now much more readable! Values like 0.12 become “12.0%” automatically.
Table Structure
Professional tables need clear headers, grouped columns, and source attribution. Great Tables provides methods for each structural component.
Adding Headers and Source Notes
Use tab_header() for titles and tab_source_note() for attribution. Let’s start with our S&P 500 data:
from great_tables import GT, md
from great_tables.data import sp500
import polars as pl

sp500_pl = pl.from_pandas(sp500)
sp500_mini = sp500_pl.filter(
(pl.col("date") >= "2010-06-07") & (pl.col("date") <= "2010-06-14")
)
print(sp500_mini)

date
open
high
low
close
volume
adj_close

2010-06-14
1095.0
1105.91
1089.03
1089.63
4.4258e9
1089.63

2010-06-11
1082.65
1092.25
1077.12
1091.6
4.0593e9
1091.6

2010-06-10
1058.77
1087.85
1058.77
1086.84
5.1448e9
1086.84

2010-06-09
1062.75
1077.74
1052.25
1055.6899
5.9832e9
1055.6899

2010-06-08
1050.8101
1063.15
1042.17
1062.0
6.1928e9
1062.0

2010-06-07
1065.84
1071.36
1049.86
1050.47
5.4676e9
1050.47

The table lacks context about what the data represents. Let’s add a title and source:
header_table = (
GT(sp500_mini)
.tab_header(
title="S&P 500 Daily Performance",
subtitle="June 7-14, 2010"
)
.fmt_currency(columns=["open", "high", "low", "close"])
.fmt_date(columns="date", date_style="wd_m_day_year")
.fmt_number(columns="volume", compact=True)
.cols_hide(columns="adj_close")
.tab_source_note(source_note=md("**Source**: Historical market data"))
)
header_table

In this example:

tab_header() adds “S&P 500 Daily Performance” as the title and “June 7-14, 2010” as the subtitle
tab_source_note() adds “Source: Historical market data” at the bottom
md() enables markdown formatting for bold text

Grouping Columns with Spanners
Column spanners group related columns under a shared label. Here’s some quarterly sales data:
import polars as pl
from great_tables import GT

sales_data = pl.DataFrame({
"product": ["Laptop", "Phone", "Tablet"],
"q1_rev": [125000, 89000, 45000],
"q2_rev": [132000, 95000, 48000],
"q1_units": [450, 1200, 380],
"q2_units": [475, 1350, 410]
})
print(sales_data)

product
q1_rev
q2_rev
q1_units
q2_units

Laptop
125000
132000
450
475

Phone
89000
95000
1200
1350

Tablet
45000
48000
380
410

The column names like q1_rev and q1_units don’t clearly show their relationship. Let’s group them with spanners:
spanner_table = (
GT(sales_data, rowname_col="product")
.tab_header(title="Quarterly Sales Report")
.tab_spanner(label="Revenue ($)", columns=["q1_rev", "q2_rev"])
.tab_spanner(label="Units Sold", columns=["q1_units", "q2_units"])
.fmt_currency(columns=["q1_rev", "q2_rev"], decimals=0)
.fmt_number(columns=["q1_units", "q2_units"], use_seps=True)
.cols_label(
q1_rev="Q1",
q2_rev="Q2",
q1_units="Q1",
q2_units="Q2"
)
.tab_stubhead(label="Product")
)
spanner_table

In this example:

tab_spanner() creates “Revenue ($)” and “Units Sold” headers that span multiple columns
cols_label() renames columns like q1_rev to “Q1”
tab_stubhead() labels the row name column as “Product”

Data Coloring
The data_color() method applies color scales to cells based on their values, creating heatmap-style visualizations. Here’s some regional performance data:
import polars as pl
from great_tables import GT

performance = pl.DataFrame({
"region": ["North", "South", "East", "West"],
"revenue": [125000, 98000, 145000, 112000],
"growth": [0.15, -0.05, 0.22, 0.08]
})
print(performance)

region
revenue
growth

North
125000
0.15

South
98000
-0.05

East
145000
0.22

West
112000
0.08

The raw numbers make it hard to spot which regions are performing well. Let’s add color scales:
color_table = (
GT(performance, rowname_col="region")
.fmt_currency(columns="revenue", decimals=0)
.fmt_percent(columns="growth", decimals=1)
.data_color(
columns="revenue",
palette="Blues"
)
.data_color(
columns="growth",
palette=["red", "white", "green"],
domain=[-0.1, 0.25]
)
)
color_table

Now high performers stand out immediately! In this example:

palette="Blues" applies a blue gradient to revenue (darker = higher values like $145,000)
palette=["red", "white", "green"] creates a diverging scale for growth (red for -5.0%, green for 22.0%)
domain=[-0.1, 0.25] sets the min/max range for the color scale

Nanoplots
Nanoplots embed small visualizations directly in table cells. They’re useful for showing trends without creating separate charts.
Creating Line Nanoplots
To use nanoplots, your data column must contain space-separated numeric values:
import polars as pl
from great_tables import GT

# Create data with trend values as space-separated strings
kpi_data = pl.DataFrame({
"metric": ["Revenue", "Users", "Conversion Rate"],
"current": [125000.0, 45000.0, 3.2],
"trend": [
"95 102 98 115 125",
"38 40 42 43 45",
"2.8 2.9 3.0 3.1 3.2"
]
})

kpi_table = (
GT(kpi_data, rowname_col="metric")
.fmt_nanoplot(columns="trend", plot_type="line")
.fmt_number(columns="current", compact=True)
.tab_header(title="Weekly KPI Dashboard")
)
kpi_table

The sparklines make trends instantly visible! fmt_nanoplot() transforms space-separated values like “95 102 98 115 125” into inline charts.
Hover over the chart to see individual data points.
Adding Reference Lines
Reference lines provide context by showing averages, medians, or custom thresholds:
import polars as pl
from great_tables import GT

trend_data = pl.DataFrame({
"stock": ["AAPL", "GOOGL", "MSFT"],
"prices": [
"150 155 148 160 165 158 170",
"120 118 122 125 128 130 127",
"280 285 275 290 295 288 300"
]
})

stock_trend_table = (
GT(trend_data, rowname_col="stock")
.fmt_nanoplot(
columns="prices",
plot_type="line",
reference_line="mean"
)
.tab_header(title="Weekly Stock Prices")
)
stock_trend_table

The reference_line="mean" parameter adds a horizontal line at the average value. Other options include "median", "min", "max", "q1", and "q3".
Bar Nanoplots
Use plot_type="bar" for comparing discrete values:
import polars as pl
from great_tables import GT

monthly_data = pl.DataFrame({
"category": ["Electronics", "Clothing", "Food"],
"sales": [
"45 52 48 55 60 58",
"30 28 35 32 38 40",
"20 22 21 25 24 26"
]
})

bar_chart_table = (
GT(monthly_data, rowname_col="category")
.fmt_nanoplot(columns="sales", plot_type="bar")
.tab_header(title="Monthly Sales by Category")
)
bar_chart_table

Customizing Nanoplot Appearance
Pass styling options via nanoplot_options():

Line: data_line_stroke_color (e.g., “steelblue”)
Points: data_point_fill_color, data_point_stroke_color
Area: data_area_fill_color (e.g., “lightblue”)

from great_tables import GT, nanoplot_options
import polars as pl

trend_data = pl.DataFrame({
"metric": ["Growth", "Engagement"],
"values": ["10 15 12 18 22 20", "5 8 6 9 11 10"]
})

styled_nanoplot_table = (
GT(trend_data, rowname_col="metric")
.fmt_nanoplot(
columns="values",
plot_type="line",
reference_line="mean",
options=nanoplot_options(
data_line_stroke_color="steelblue",
data_point_fill_color="white",
data_point_stroke_color="steelblue",
data_area_fill_color="lightblue"
)
)
)
styled_nanoplot_table

Conditional Styling
The tab_style() method applies formatting to cells based on conditions. Combined with Polars expressions, you can create data-driven styling rules.
Basic Conditional Styling
Here’s some product sales data with mixed growth values:
from great_tables import GT, style, loc
import polars as pl

sales = pl.DataFrame({
"product": ["Laptop", "Phone", "Tablet", "Monitor"],
"revenue": [125000, 89000, 45000, 32000],
"growth": [0.15, -0.05, 0.22, -0.08]
})
print(sales)

product
revenue
growth

Laptop
125000
0.15

Phone
89000
-0.05

Tablet
45000
0.22

Monitor
32000
-0.08

Some products have positive growth, others negative. Let’s use tab_style() with Polars expressions to apply conditional colors:
conditional_table = (
GT(sales, rowname_col="product")
.fmt_currency(columns="revenue", decimals=0)
.fmt_percent(columns="growth", decimals=1)
.tab_style(
style=[
style.fill(color="lightgreen"),
style.text(weight="bold")
],
locations=loc.body(
columns="growth",
rows=pl.col("growth") > 0
)
)
.tab_style(
style=[
style.fill(color="lightcoral"),
style.text(weight="bold")
],
locations=loc.body(
columns="growth",
rows=pl.col("growth") < 0
)
)
)
conditional_table

The styling makes values immediately visible:

pl.col("growth") > 0 – selects rows with positive growth
pl.col("growth") < 0 – selects rows with negative growth

Conclusion
Great Tables transforms how data scientists present tabular data. Instead of manual formatting in spreadsheets, you can:

Format currencies, percentages, and dates automatically
Structure tables with headers, column groups, and source notes
Highlight patterns with automatic color scales
Show trends with inline sparkline charts
Apply conditional styling based on data values

The key advantage is reproducibility. When your data updates, you can re-run the script to regenerate the formatted table with consistent styling.

📚 For comprehensive guidance on building reproducible data workflows, check out Production-Ready Data Science.

Great Tables is particularly useful for:

Financial reports with currency and percentage formatting
Performance dashboards with trend indicators
Research papers requiring publication-quality tables
Automated reporting pipelines

For more features including custom themes, image embedding, and interactive outputs, see the Great Tables documentation.
Related Tutorials

Top 6 Python Libraries for Visualization for interactive and static data visualizations
Marimo: A Modern Notebook for Reproducible Data Science for reproducible notebook workflows

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Great Tables: Publication-Ready Tables from Polars and Pandas DataFrames Read More »

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Table of Contents

Introduction
Tool Strengths at a Glance
Setup
Syntax Comparison
Data Loading Performance
Query Optimization
GroupBy Performance
Memory Efficiency
Join Operations
Interoperability
Decision Matrix
Final Thoughts

Introduction
pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.
Each tool excels in different scenarios:

Tool
Backend
Execution Model
Best For

pandas
C/Python
Eager, single-threaded
Small datasets, prototyping, ML integration

Polars
Rust
Lazy/Eager, multi-threaded
Large-scale analytics, data pipelines

DuckDB
C++
SQL-first, multi-threaded
SQL workflows, embedded analytics, file queries

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Tool Strengths at a Glance
pandas
pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

Direct compatibility with scikit-learn, statsmodels, and visualization libraries
Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
Mature time series functionality
Familiar syntax that most data scientists already know

Polars
Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

Speeds up operations by using all available CPU cores by default
Builds a query plan first, then executes only what’s needed
Streaming mode for processing datasets larger than RAM
Expressive method chaining with a pandas-like API

DuckDB
DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

Native SQL syntax with full analytical query support
Queries CSV, Parquet, and JSON files directly without loading
Uses disk storage automatically when data exceeds available memory
Zero-configuration embedded database requiring no server setup

Setup
Install all three libraries:
pip install pandas polars duckdb

Generate sample data for benchmarking:
import pandas as pd
import numpy as np

np.random.seed(42)
n_rows = 5_000_000

data = {
"category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
"region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
"amount": np.random.rand(n_rows) * 1000,
"quantity": np.random.randint(1, 100, size=n_rows),
}

df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")

Created sales_data.csv with 5,000,000 rows

Syntax Comparison
All three tools can perform the same operations with different syntax. Here’s a side-by-side comparison of common tasks.
Filtering Rows
pandas:
Uses bracket notation with boolean conditions, which requires repeating the DataFrame name for each condition:
import pandas as pd

df_pd = pd.read_csv("sales_data.csv")
result_bracket = df_pd[(df_pd["amount"] > 500) & (df_pd["category"] == "Electronics")]
result_bracket.head()

category
region
amount
quantity

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

30
Electronics
North
766.081832
7

31
Electronics
West
772.084261
36

37
Electronics
East
527.967145
35

Alternatively, you can use the query() method, which provides cleaner SQL-like syntax:
result_query = df_pd.query("amount > 500 and category == 'Electronics'")

However, since query() is string-based, there’s no IDE autocomplete. Complex operations like string methods still require brackets:
result_str = df_pd[df_pd["category"].str.startswith("Elec")]
result_str.head()

category
region
amount
quantity

2
Electronics
North
450.941022
93

6
Electronics
West
475.843957
61

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

21
Electronics
South
292.399383
13

Polars:
Unlike pandas, Polars uses one syntax for all filters. The pl.col() expressions are type-safe with IDE autocomplete, and handle both simple comparisons and complex operations like string methods:
import polars as pl

df_pl = pl.read_csv("sales_data.csv")
result_pl = df_pl.filter(
(pl.col("amount") > 500) & (pl.col("category") == "Electronics")
)
result_pl.head()

category
region
amount
quantity

str
str
f64
i64

“Electronics”
“West”
662.803066
80

“Electronics”
“North”
826.004963
25

“Electronics”
“North”
766.081832
7

“Electronics”
“West”
772.084261
36

“Electronics”
“East”
527.967145
35

DuckDB:
Uses standard SQL with a WHERE clause, which is more readable by those who know SQL.
import duckdb

result_duckdb = duckdb.sql("""
SELECT * FROM 'sales_data.csv'
WHERE amount > 500 AND category = 'Electronics'
""").df()
result_duckdb.head()

category
region
amount
quantity

0
Electronics
West
662.803066
80

1
Electronics
North
826.004963
25

2
Electronics
North
766.081832
7

3
Electronics
West
772.084261
36

4
Electronics
East
527.967145
35

Selecting Columns
pandas:
Double brackets return a DataFrame with selected columns.
result_pd = df_pd[["category", "amount"]]
result_pd.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

Polars:
The select() method clearly communicates column selection intent.
result_pl = df_pl.select(["category", "amount"])
result_pl.head()

category
amount

str
f64

“Food”
516.653322

“Books”
937.337226

“Electronics”
450.941022

“Food”
674.488081

“Food”
188.847906

DuckDB:
SQL’s SELECT clause makes column selection intuitive for SQL users.
result_duckdb = duckdb.sql("""
SELECT category, amount FROM 'sales_data.csv'
""").df()
result_duckdb.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

GroupBy Aggregation
pandas:
Uses a dictionary to specify aggregations, but returns multi-level column headers that often require flattening before further use.
result_pd = df_pd.groupby("category").agg({
"amount": ["sum", "mean"],
"quantity": "sum"
})
result_pd.head()

amount

quantity

sum
mean
sum

Books
6.247506e+08
499.998897
62463285

Clothing
6.253924e+08
500.139837
62505224

Electronics
6.244453e+08
499.938189
62484265

Food
6.254034e+08
499.916417
62577943

Polars:
Uses explicit alias() calls for each aggregation, producing flat column names directly without post-processing.
result_pl = df_pl.group_by("category").agg([
pl.col("amount").sum().alias("amount_sum"),
pl.col("amount").mean().alias("amount_mean"),
pl.col("quantity").sum().alias("quantity_sum"),
])
result_pl.head()

category
amount_sum
amount_mean
quantity_sum

str
f64
f64
i64

“Clothing”
6.2539e8
500.139837
62505224

“Books”
6.2475e8
499.998897
62463285

“Electronics”
6.2445e8
499.938189
62484265

“Food”
6.2540e8
499.916417
62577943

DuckDB:
Standard SQL aggregation with column aliases produces clean, flat output ready for downstream use.
result_duckdb = duckdb.sql("""
SELECT
category,
SUM(amount) as amount_sum,
AVG(amount) as amount_mean,
SUM(quantity) as quantity_sum
FROM 'sales_data.csv'
GROUP BY category
""").df()
result_duckdb.head()

category
amount_sum
amount_mean
quantity_sum

0
Food
6.254034e+08
499.916417
62577943.0

1
Electronics
6.244453e+08
499.938189
62484265.0

2
Clothing
6.253924e+08
500.139837
62505224.0

3
Books
6.247506e+08
499.998897
62463285.0

Adding Columns
pandas:
The assign() method creates new columns with repeated DataFrame references like df_pd["amount"].
result_pd = df_pd.assign(
amount_with_tax=df_pd["amount"] * 1.1,
high_value=df_pd["amount"] > 500
)
result_pd.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Polars:
The with_columns() method uses composable expressions that chain naturally without repeating the DataFrame name.
result_pl = df_pl.with_columns([
(pl.col("amount") * 1.1).alias("amount_with_tax"),
(pl.col("amount") > 500).alias("high_value")
])
result_pl.head()

category
region
amount
quantity
amount_with_tax
high_value

str
str
f64
i64
f64
bool

“Food”
“South”
516.653322
40
568.318654
true

“Books”
“East”
937.337226
45
1031.070948
true

“Electronics”
“North”
450.941022
93
496.035124
false

“Food”
“East”
674.488081
46
741.936889
true

“Food”
“East”
188.847906
98
207.732697
false

DuckDB:
SQL’s SELECT clause defines new columns directly in the query, keeping transformations readable.
result_duckdb = duckdb.sql("""
SELECT *,
amount * 1.1 as amount_with_tax,
amount > 500 as high_value
FROM df_pd
""").df()
result_duckdb.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Conditional Logic
pandas:
Each additional condition in np.where() adds another nesting level. With three tiers, the final value is buried two levels deep:
import numpy as np

# Hard to read: "low" is nested inside two np.where() calls
result_np = df_pd.assign(
value_tier=np.where(
df_pd["amount"] > 700, "high",
np.where(df_pd["amount"] > 300, "medium", "low")
)
)
result_np[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

For numeric binning, pd.cut() is cleaner:
result_pd = df_pd.assign(
value_tier=pd.cut(
df_pd["amount"],
bins=[-np.inf, 300, 700, np.inf],
labels=["low", "medium", "high"]
)
)
result_pd[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

However, pd.cut() has drawbacks:

Only works for numeric ranges
Requires thinking in boundaries ([-inf, 300, 700, inf]) instead of conditions (amount > 700)
Needs numpy for open-ended bins

For non-numeric or mixed conditions, you’re back to np.where():
# "premium" if Electronics AND amount > 500 – pd.cut() can't do this
result = df_pd.assign(
tier=np.where(
(df_pd["category"] == "Electronics") & (df_pd["amount"] > 500),
"premium", "standard"
)
)
result.head()

category
region
amount
quantity
tier

0
Food
South
516.653322
40
standard

1
Books
East
937.337226
45
standard

2
Electronics
North
450.941022
93
standard

3
Food
East
674.488081
46
standard

4
Food
East
188.847906
98
standard

Polars:
The when().then().otherwise() chain solves both pandas problems: no nesting like np.where(), and works for any condition (not just numeric ranges like pd.cut()). The same syntax handles simple binning and complex mixed conditions:
result_pl = df_pl.with_columns(
pl.when(pl.col("amount") > 700).then(pl.lit("high"))
.when(pl.col("amount") > 300).then(pl.lit("medium"))
.otherwise(pl.lit("low"))
.alias("value_tier")
)
result_pl.select(["category", "amount", "value_tier"]).head()

category
amount
value_tier

str
f64
str

“Food”
516.653322
“medium”

“Books”
937.337226
“high”

“Electronics”
450.941022
“medium”

“Food”
674.488081
“medium”

“Food”
188.847906
“low”

DuckDB:
Standard SQL CASE WHEN syntax is more readable by those who know SQL.
result_duckdb = duckdb.sql("""
SELECT category, amount,
CASE
WHEN amount > 700 THEN 'high'
WHEN amount > 300 THEN 'medium'
ELSE 'low'
END as value_tier
FROM df_pd
""").df()
result_duckdb.head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

Window Functions
pandas:
Uses groupby().transform() which requires repeating the groupby clause for each calculation.
result_pd = df_pd.assign(
category_avg=df_pd.groupby("category")["amount"].transform("mean"),
category_rank=df_pd.groupby("category")["amount"].rank(ascending=False)
)
result_pd[["category", "amount", "category_avg", "category_rank"]].head()

category
amount
category_avg
category_rank

0
Food
516.653322
499.916417
604342.0

1
Books
937.337226
499.998897
78423.0

2
Electronics
450.941022
499.938189
685881.0

3
Food
674.488081
499.916417
407088.0

4
Food
188.847906
499.916417
1015211.0

Polars:
The over() expression appends the partition to any calculation, avoiding repeated group definitions.
result_pl = df_pl.with_columns([
pl.col("amount").mean().over("category").alias("category_avg"),
pl.col("amount").rank(descending=True).over("category").alias("category_rank")
])
result_pl.select(["category", "amount", "category_avg", "category_rank"]).head()

category
amount
category_avg
category_rank

str
f64
f64
f64

“Food”
516.653322
499.916417
604342.0

“Books”
937.337226
499.998897
78423.0

“Electronics”
450.941022
499.938189
685881.0

“Food”
674.488081
499.916417
407088.0

“Food”
188.847906
499.916417
1015211.0

DuckDB:
SQL window functions with OVER (PARTITION BY …) are the industry standard for this type of calculation.
result_duckdb = duckdb.sql("""
SELECT category, amount,
AVG(amount) OVER (PARTITION BY category) as category_avg,
RANK() OVER (PARTITION BY category ORDER BY amount DESC) as category_rank
FROM df_pd
""").df()
result_duckdb.head()

category
amount
category_avg
category_rank

0
Clothing
513.807166
500.139837
608257

1
Clothing
513.806596
500.139837
608258

2
Clothing
513.806515
500.139837
608259

3
Clothing
513.806063
500.139837
608260

4
Clothing
513.806056
500.139837
608261

Data Loading Performance
pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.
pandas
Single-threaded CSV parsing loads data sequentially.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → … → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

pandas_time = %timeit -o pd.read_csv("sales_data.csv")

1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars
Multi-threaded parsing distributes file reading across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

polars_time = %timeit -o pl.read_csv("sales_data.csv")

137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB
Similar to Polars, file reading is distributed across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()

762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")

Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas

While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.
Query Optimization
pandas: No Optimization
pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows → 10M rows in memory │
│ Step 2: Filter (amount > 100) → 5M rows in memory │
│ Step 3: GroupBy → New DataFrame │
│ Step 4: Mean → Final result │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high – stores all intermediates)

def pandas_query():
return (
pd.read_csv("sales_data.csv")
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_opt_time = %timeit -o pandas_query()

1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

Full CSV load: All rows are read before filtering
No predicate pushdown: Rows are filtered after loading the entire file into memory
No projection pushdown: All columns are loaded, even unused ones

You can manually add usecols to load fewer columns:
def pandas_query_optimized():
return (
pd.read_csv("sales_data.csv", usecols=["category", "amount"])
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_usecols_time = %timeit -o pandas_query_optimized()

1.06 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is faster, but has two drawbacks:

Manual tracking: You must specify columns yourself; change the query, update usecols
No row filtering: All rows still load before the filter applies

Polars and DuckDB handle both automatically by analyzing your query before execution.
Polars: Lazy Evaluation
Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ scan_csv → filter → group_by → agg │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (filter during scan) │
│ • Projection pushdown (read only needed columns) │
│ • Multi-threaded execution (parallel across CPU cores) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – no intermediate DataFrames)

query_pl = (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
)

# View the optimized query plan
print(query_pl.explain())

AGGREGATE[maintain_order: false]
[col("amount").mean().alias("avg_amount")] BY [col("category")]
FROM
Csv SCAN [sales_data.csv] [id: 4687118704]
PROJECT 2/4 COLUMNS
SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

Predicate pushdown: SELECTION filters during scan, not after loading
Projection pushdown: PROJECT 2/4 COLUMNS reads only what’s needed
Operation reordering: Aggregate runs on filtered data, not the full dataset

Execute the optimized query:
def polars_query():
return (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
.collect()
)

polars_opt_time = %timeit -o polars_query()

148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer
DuckDB’s SQL optimizer applies similar optimizations automatically:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ SQL → Parser → Optimizer → Execution Plan │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (WHERE during scan) │
│ • Projection pushdown (SELECT only needed columns) │
│ • Vectorized execution (process 1024 rows per batch) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – streaming execution)

def duckdb_query():
return duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
WHERE amount > 100
GROUP BY category
""").df()

duckdb_opt_time = %timeit -o duckdb_query()

245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:
print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")

Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas

Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.
GroupBy Performance
Computing aggregates requires scanning every row, a workload that scales linearly with CPU cores. This makes groupby operations the clearest test of parallel execution.
Let’s load the data for the groupby benchmarks:
# Load data for fair comparison
df_pd = pd.read_csv("sales_data.csv")
df_pl = pl.read_csv("sales_data.csv")

pandas: Single-Threaded
pandas processes groupby operations on a single CPU core, which becomes a bottleneck on large datasets.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Group A → Group B → Group C → Group D → … → Aggregate │ │
│ └─────────────────────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────────────────────┘

def pandas_groupby():
return df_pd.groupby("category")["amount"].mean()

pandas_groupby_time = %timeit -o pandas_groupby()

271 ms ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars: Multi-Threaded
Polars splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def polars_groupby():
return df_pl.group_by("category").agg(pl.col("amount").mean())

polars_groupby_time = %timeit -o polars_groupby()

31.1 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def duckdb_groupby():
return duckdb.sql("""
SELECT category, AVG(amount)
FROM df_pd
GROUP BY category
""").df()

duckdb_groupby_time = %timeit -o duckdb_groupby()

29 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

print(f"Polars is {pandas_groupby_time.average / polars_groupby_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_groupby_time.average / duckdb_groupby_time.average:.1f}× faster than pandas")

Polars is 8.7× faster than pandas
DuckDB is 9.4× faster than pandas

DuckDB and Polars perform similarly (9.4× vs 8.7×), both leveraging parallel execution. DuckDB’s slight edge comes from late materialization and vector-at-a-time pipelined execution, which avoids creating intermediate results that Polars may still materialize for some operations.
Memory Efficiency
pandas: Full Memory Load
pandas loads the entire dataset into RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████│ │
│ │██████████████████ ALL 10M ROWS ████████████████████████│ │
│ │████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 707,495 KB (entire dataset in memory) │
└─────────────────────────────────────────────────────────────┘

df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")

pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.
Polars: Streaming Mode
Polars can process data in streaming mode, handling chunks without loading everything:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │█ │ │
│ │ (result only) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 0.06 KB (streams chunks, keeps only result) │
└─────────────────────────────────────────────────────────────┘

result_pl_stream = (
pl.scan_csv("sales_data.csv")
.group_by("category")
.agg(pl.col("amount").mean())
.collect(streaming=True)
)

polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")

Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:
(
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 500)
.sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk
DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM Disk (if needed) │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │█ │ │░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ (up to 500MB) │ → │ (overflow here) │ │
│ │ │ │ │ │
│ └──────────────────────────┘ └──────────────────────┘ │
│ Usage: 0.42 KB (spills to disk when RAM full) │
└─────────────────────────────────────────────────────────────┘

# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")

# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
GROUP BY category
""").df()

duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")

DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.
print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")

pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)

Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.
Join Operations
Joining tables is one of the most common operations in data analysis. Let’s compare how each tool handles a left join between 1 million orders and 100K customers.
Let’s create two tables for join benchmarking:
# Create orders table (1M rows)
orders_pd = pd.DataFrame({
"order_id": range(1_000_000),
"customer_id": np.random.randint(1, 100_000, size=1_000_000),
"amount": np.random.rand(1_000_000) * 500
})

# Create customers table (100K rows)
customers_pd = pd.DataFrame({
"customer_id": range(100_000),
"region": np.random.choice(["North", "South", "East", "West"], size=100_000)
})

# Convert to Polars
orders_pl = pl.from_pandas(orders_pd)
customers_pl = pl.from_pandas(customers_pd)

pandas: Single-Threaded
pandas processes the join on a single CPU core.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Row 1 → Row 2 → Row 3 → … → Row 1M │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

def pandas_join():
return orders_pd.merge(customers_pd, on="customer_id", how="left")

pandas_join_time = %timeit -o pandas_join()

60.4 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Polars: Multi-Threaded
Polars distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def polars_join():
return orders_pl.join(customers_pl, on="customer_id", how="left")

polars_join_time = %timeit -o polars_join()

11.8 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def duckdb_join():
return duckdb.sql("""
SELECT o.*, c.region
FROM orders_pd o
LEFT JOIN customers_pd c ON o.customer_id = c.customer_id
""").df()

duckdb_join_time = %timeit -o duckdb_join()

55.7 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s compare the performance of the joins:
print(f"Polars is {pandas_join_time.average / polars_join_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_join_time.average / duckdb_join_time.average:.1f}× faster than pandas")

Polars is 5.1× faster than pandas
DuckDB is 1.1× faster than pandas

Polars delivers a 5.1× speedup while DuckDB shows only 1.1× improvement. Both tools use multi-threading, but Polars’ join algorithm and native DataFrame output avoid the conversion overhead that DuckDB incurs when returning results via .df().
Interoperability
All three tools work together seamlessly. Use each tool for what it does best in a single pipeline.
pandas DataFrame to DuckDB
Query pandas DataFrames directly with SQL:
df = pd.DataFrame({
"product": ["A", "B", "C"],
"sales": [100, 200, 150]
})

# DuckDB queries pandas DataFrames by variable name
result = duckdb.sql("SELECT * FROM df WHERE sales > 120").df()
print(result)

product sales
0 B 200
1 C 150

Polars to pandas
Convert Polars DataFrames when ML libraries require pandas:
df_polars = pl.DataFrame({
"feature1": [1, 2, 3],
"feature2": [4, 5, 6],
"target": [0, 1, 0]
})

# Convert to pandas for scikit-learn
df_pandas = df_polars.to_pandas()
print(type(df_pandas))

<class 'pandas.core.frame.DataFrame'>

DuckDB to Polars
Get query results as Polars DataFrames:
result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales_data.csv'
GROUP BY category
""").pl()

print(type(result))
print(result)

<class 'polars.dataframe.frame.DataFrame'>
shape: (4, 2)
┌─────────────┬──────────┐
│ category ┆ total │
│ — ┆ — │
│ str ┆ f64 │
╞═════════════╪══════════╡
│ Electronics ┆ 6.2445e8 │
│ Food ┆ 6.2540e8 │
│ Clothing ┆ 6.2539e8 │
│ Books ┆ 6.2475e8 │
└─────────────┴──────────┘

Combined Pipeline Example
Each tool has a distinct strength: DuckDB optimizes SQL queries, Polars parallelizes transformations, and pandas integrates with ML libraries. Combine them in a single pipeline to leverage all three:
# Step 1: DuckDB for initial SQL query
aggregated = duckdb.sql("""
SELECT category, region,
SUM(amount) as total_amount,
COUNT(*) as order_count
FROM 'sales_data.csv'
GROUP BY category, region
""").pl()

# Step 2: Polars for additional transformations
enriched = (
aggregated
.with_columns([
(pl.col("total_amount") / pl.col("order_count")).alias("avg_order_value"),
pl.col("category").str.to_uppercase().alias("category_upper")
])
.filter(pl.col("order_count") > 100000)
)

# Step 3: Convert to pandas for visualization or ML
final_df = enriched.to_pandas()
print(final_df.head())

category region total_amount order_count avg_order_value category_upper
0 Food East 1.563586e+08 312918 499.679004 FOOD
1 Food North 1.563859e+08 312637 500.215456 FOOD
2 Clothing North 1.560532e+08 311891 500.345286 CLOTHING
3 Clothing East 1.565054e+08 312832 500.285907 CLOTHING
4 Food West 1.560994e+08 312662 499.259318 FOOD

📖 Related: For writing functions that work across pandas, Polars, and PySpark without conversion, see Unified DataFrame Functions.

Decision Matrix
No single tool wins in every scenario. Use these tables to choose the right tool for your workflow.
Performance Summary
Benchmark results from 10 million rows on a single machine:

Operation
pandas
Polars
DuckDB

CSV Read (10M rows)
1.05s
137ms
762ms

GroupBy
271ms
31ms
29ms

Join (1M rows)
60ms
12ms
56ms

Memory Usage
707 MB
0.06 KB (streaming)
0.42 KB (spill-to-disk)

Polars leads in CSV reading (7.7× faster than pandas) and joins (5× faster). DuckDB matches Polars in groupby performance and uses the least memory with automatic spill-to-disk.
Feature Comparison
Each tool makes different trade-offs between speed, memory, and ecosystem integration:

Feature
pandas
Polars
DuckDB

Multi-threading
No
Yes
Yes

Lazy evaluation
No
Yes
N/A (SQL)

Query optimization
No
Yes
Yes

Larger-than-RAM
No
Streaming
Spill-to-disk

SQL interface
No
Limited
Native

ML integration
Excellent
Good
Limited

pandas lacks the performance features that make Polars and DuckDB fast, but remains essential for ML workflows. Choose between Polars and DuckDB based on whether you prefer DataFrame chaining or SQL syntax.
Recommendations
The best tool depends on your data size, workflow preferences, and constraints:

Small data (<1M rows): Use pandas for simplicity
Large data (1M-100M rows): Use Polars or DuckDB for 5-10× speedup
SQL-preferred workflow: Use DuckDB
DataFrame-preferred workflow: Use Polars
Memory-constrained: Use Polars (streaming) or DuckDB (spill-to-disk)
ML pipeline integration: Use pandas (convert from Polars/DuckDB as needed)
Production data pipelines: Use Polars (DataFrame) or DuckDB (SQL) based on team preference

Final Thoughts
If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:

Profile first: Find which pandas operations are slow
Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() when needed.
Related Resources

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames
A Deep Dive into DuckDB for Data Scientists
Scaling Pandas Workflows with PySpark’s Pandas API
Delta Lake: Transform pandas Prototypes into Production

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool Read More »

Coiled: Scale Python Data Pipeline to the Cloud in Minutes

Table of Contents

Introduction
What is Coiled?
Setup
Serverless Functions: Process Data with Any Framework
When You Need More: Distributed Clusters with Dask
Cost Optimization
Environment Synchronization
Conclusion

Introduction
A common challenge data scientists face is that local machines simply can’t handle large-scale datasets. Once your analysis reaches the 50GB+ range, you’re pushed into difficult choices:

Sample your data and hope patterns hold
Buy more RAM or rent expensive cloud VMs
Learn Kubernetes and spend days configuring clusters
Build Docker images and manage container registries

Each option adds complexity, cost, or compromises your analysis quality.
Coiled eliminates these tradeoffs. It provisions emphemeral compute clusters on AWS, GCP, or Azure using simple Python APIs. You get distributed computing power without DevOps expertise, automatic environment synchronization without Docker, and 70% cost savings through smart spot instance management.
In this article, you’ll learn how to scale Python data workflows to the cloud with Coiled:

Serverless functions: Run pandas, Polars, or DuckDB code on cloud VMs with a simple decorator
Parallel processing: Process multiple files simultaneously across cloud machines
Distributed clusters: Aggregate data across files using managed Dask clusters
Environment sync: Replicate your local packages to the cloud without Docker
Cost optimization: Reduce cloud spending with spot instances and auto-scaling

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What is Coiled?
Coiled is a lightweight cloud platform that runs Python code on powerful cloud infrastructure without requiring Docker or Kubernetes knowledge. It supports four main capabilities:

Batch Jobs: Submit and run Python scripts asynchronously on cloud infrastructure
Serverless Functions: Execute Python functions (Pandas, Polars, PyTorch) on cloud VMs with decorators
Dask Clusters: Provision multi-worker clusters for distributed computing
Jupyter Notebooks: Launch interactive Jupyter servers directly on cluster schedulers

Key features across both:

Framework-Agnostic: Works with Pandas, Polars, Dask, or any Python library
Automatic Package Sync: Local packages replicate to cloud workers without Docker
Cost Optimization: Spot instances, adaptive scaling, and auto-shutdown reduce spending
Simple APIs: Decorate functions or create clusters with 2-3 lines of code

To install Coiled and Dask, run:
pip install coiled dask[complete]

Setup
First, create a free Coiled account by running this command in your terminal:
coiled login

This creates a free Coiled Hosted account with 200 CPU-hours per month. Your code runs on Coiled’s cloud infrastructure with no AWS/GCP/Azure account needed.
Note: For production workloads with your own cloud account, run coiled setup PROVIDER (setup guide).
Serverless Functions: Process Data with Any Framework
The simplest way to scale Python code to the cloud is with serverless functions. Decorate any function with @coiled.function, and Coiled handles provisioning cloud VMs, installing packages, and executing your code.
Scale Beyond Laptop Memory with Cloud VMs
Imagine you need to process a NYCTaxi dataset with 12GB compressed files (50GB+ expanded) on a laptop with only 16GB of RAM. Your machine simply doesn’t have enough memory to handle this workload.
With Coiled, you can run the exact same code on a cloud VM with 64GB of RAM by simply adding the @coiled.function decorator.
import coiled
import pandas as pd

@coiled.function(
memory="64 GiB",
region="us-east-1"
)
def process_month_with_pandas(month):
# Read 12GB file directly into pandas
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)

# Compute tipping patterns by hour
df["hour"] = df["tpep_pickup_datetime"].dt.hour
result = df.groupby("hour")["tip_amount"].mean()

return result

# Run on cloud VM with 64GB RAM
january_tips = process_month_with_pandas(1)
print(january_tips)

Output:
hour
0 3.326543
1 2.933899
2 2.768246
3 2.816333
4 3.132973

Name: tip_amount, dtype: float64

This function runs on a cloud VM with 64GB RAM, processes the entire month in memory, and returns just the aggregated result to your laptop.
You can view the function’s execution progress and resource usage in the Coiled dashboard.

Parallel Processing with .map()
By default Coiled Functions will run sequentially, just like normal Python functions. However, they can also easily run in parallel by using the .map() method.
Process all 12 months in parallel using .map():
import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_month(month):
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return df["tip_amount"].mean()

# Process 12 months in parallel on 12 cloud VMs
months = range(1, 13)
monthly_tips = list(process_month.map(months))

print("Average tips by month:", monthly_tips)

Output:
Average tips by month: [2.65, 2.58, 2.72, 2.68, 2.75, 2.81, 2.79, 2.73, 2.69, 2.71, 2.66, 2.62]

When you call .map() with 12 months, Coiled spins up 12 cloud VMs simultaneously, runs process_month() on each VM with a different month, then returns all results.
The execution flow:
VM 1: yellow_tripdata_2024-01.parquet → compute mean → 2.65
VM 2: yellow_tripdata_2024-02.parquet → compute mean → 2.58
VM 3: yellow_tripdata_2024-03.parquet → compute mean → 2.72
… (all running in parallel)
VM 12: yellow_tripdata_2024-12.parquet → compute mean → 2.62

Coiled collects: [2.65, 2.58, 2.72, …, 2.62]

Each VM works in complete isolation with no data sharing or coordination between them.

The dashboard confirms 12 tasks were executed, matching the 12 months we passed to .map().
Framework-Agnostic: Use Any Python Library
Coiled Functions aren’t limited to pandas. You can use any Python library (Polars, DuckDB, PyTorch, scikit-learn) without any additional configuration. The automatic package synchronization works for all dependencies.
Example with Polars:
Polars is a fast DataFrame library optimized for performance. It works seamlessly with Coiled:
import coiled
import polars as pl

@coiled.function(memory="64 GiB", region="us-east-1")
def process_with_polars(month):
df = pl.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return (
df
.filter(pl.col("tip_amount") > 0)
.group_by("PULocationID")
.agg(pl.col("tip_amount").mean())
.sort("tip_amount", descending=True)
.head(5)
)

result = process_with_polars(1)
print(result)

Output:
shape: (5, 2)
┌──────────────┬────────────┐
│ PULocationID ┆ tip_amount │
│ — ┆ — │
│ i64 ┆ f64 │
╞══════════════╪════════════╡
│ 138 ┆ 4.52 │
│ 230 ┆ 4.23 │
│ 161 ┆ 4.15 │
│ 234 ┆ 3.98 │
│ 162 ┆ 3.87 │
└──────────────┴────────────┘

Example with DuckDB:
DuckDB provides fast SQL analytics directly on Parquet files:
import coiled
import duckdb

@coiled.function(memory="64 GiB", region="us-east-1")
def query_with_duckdb(month):
con = duckdb.connect()
result = con.execute(f"""
SELECT
DATE_TRUNC('hour', tpep_pickup_datetime) as pickup_hour,
AVG(tip_amount) as avg_tip,
COUNT(*) as trip_count
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet'
WHERE tip_amount > 0
GROUP BY pickup_hour
ORDER BY avg_tip DESC
LIMIT 5
""").fetchdf()
return result

result = query_with_duckdb(1)
print(result)

Output:
pickup_hour avg_tip trip_count
0 2024-01-15 14:00:00 4.23 15234
1 2024-01-20 18:00:00 4.15 18456
2 2024-01-08 12:00:00 3.98 12789
3 2024-01-25 16:00:00 3.87 14567
4 2024-01-12 20:00:00 3.76 16234

Coiled automatically detects your local Polars and DuckDB installations and replicates them to cloud VMs. No manual configuration needed.
When You Need More: Distributed Clusters with Dask
Serverless functions work great for independent file processing. However, when you need to combine and aggregate data across all your files into a single result, you need a Dask cluster.
For example, suppose you want to calculate total revenue by pickup location across all 12 months of data. With Coiled Functions, each VM processes one month independently:
@coiled.function(memory="64 GiB", region="us-east-1")
def get_monthly_revenue_by_location(month):
df = pd.read_parquet(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
)
return df.groupby("PULocationID")["total_amount"].sum()

# This returns 12 separate DataFrames, one per month
results = list(get_monthly_revenue_by_location.map(range(1, 13)))
print(f'Number of DataFrames: {len(results)}')

Output:
Number of DataFrames: 12

The problem is that you get 12 separate DataFrames that you need to manually combine.
Here’s what happens: VM 1 processes January and returns a DataFrame like:
PULocationID total_amount
138 15000
230 22000

VM 2 processes February and returns:
PULocationID total_amount
138 18000
230 19000

Each VM works independently and has no knowledge of the other months’ data. To get yearly totals per location, you’d need to write code to merge these 12 DataFrames and sum the revenue for each location.
With a Dask cluster, workers coordinate to give you one global result:
import coiled
import dask.dataframe as dd

# For production workloads, you can scale to 50+ workers
cluster = coiled.Cluster(n_workers=3, region="us-east-1")

# Read all 12 months of 2024 data
files = [
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
for month in range(1, 13)
]
df = dd.read_parquet(files) # Lazy: builds a plan, doesn't load data yet

# This returns ONE DataFrame with total revenue per location across all months
total_revenue = (
df.groupby("PULocationID")["total_amount"].sum().compute()
) # Executes the plan

total_revenue.head()

Output:
PULocationID
1 563645.70
2 3585.80
3 67261.41
4 1687265.08
5 602.98
Name: total_amount, dtype: float64

You can see that we got a single DataFrame with the total revenue per location across all months.
Here is what happens under the hood:
When you call .compute(), Dask executes the plan in four steps:
Step 1: Data Distribution
├─ Worker 1: [Jan partitions 1-3, Apr partitions 1-2, Jul partitions 1-3]
├─ Worker 2: [Feb partitions 1-4, May partitions 1-3, Aug partitions 1-2]
└─ Worker 3: [Mar partitions 1-3, Jun partitions 1-2, Sep-Dec partitions]

Step 2: Local Aggregation (each worker groups its data)
├─ Worker 1: {location_138: $45,000, location_230: $63,000}
├─ Worker 2: {location_138: $38,000, location_230: $55,000}
└─ Worker 3: {location_138: $50,000, location_230: $62,000}

Step 3: Shuffle (redistribute so each location lives on one worker)
├─ Worker 1: All location_230 data → $63,000 + $55,000 + $62,000
├─ Worker 2: All location_138 data → $45,000 + $38,000 + $50,000
└─ Worker 3: All other locations…

Step 4: Final Result
location_138: $133,000 (yearly total)
location_230: $180,000 (yearly total)

This shuffle-and-combine process is what makes Dask different from Coiled Functions. Workers actively coordinate and share data to produce one unified result.
Cost Optimization
Cloud costs can spiral quickly. Coiled provides three mechanisms to reduce spending:
1. Spot Instances
You can reduce cloud costs by 60-90% using spot instances. These are discounted servers that cloud providers can reclaim when demand increases. When an interruption occurs, Coiled:

Gracefully shuts down the affected worker
Redistributes its work to healthy workers
Automatically launches a replacement worker

cluster = coiled.Cluster(
n_workers=50,
spot_policy="spot_with_fallback", # Use spot instances with on-demand backup
region="us-east-1"
)

Cost comparison for m5.xlarge instances:

On-demand: $0.192/hour
Spot: $0.05/hour
Savings: 74%

For a 100-worker cluster:

On-demand: $19.20/hour = $460/day
Spot: $5.00/hour = $120/day

2. Adaptive Scaling
Adaptive scaling automatically adds workers when you have more work and removes them when idle, so you only pay for what you need. Coiled enables this with the adapt() method:
cluster = coiled.Cluster(region="us-east-1")
cluster.adapt(minimum=10, maximum=50) # Scale between 10-50 workers

Serverless functions also support auto-scaling by specifying a worker range:
@coiled.function(n_workers=[10, 300])
def process_data(files):
return results

This saves money during light workloads while delivering performance during heavy computation. No manual monitoring required.
3. Automatic Shutdown
To prevent paying for unused resources, Coiled automatically shuts down clusters after 20 minutes of inactivity by default. You can customize this with the idle_timeout parameter:
cluster = coiled.Cluster(
n_workers=20,
region="us-east-1",
idle_timeout="1 hour" # Keep cluster alive for longer workloads
)

This prevents the common mistake of leaving clusters running overnight.
Environment Synchronization
The “Works on My Machine” Problem When Scaling to Cloud
Imagine this scenario: your pandas code works locally but fails on a cloud VM because the environment has different package versions or missing dependencies.
Docker solves this by packaging your environment into a container that runs identically on your laptop and cloud VMs. However, getting it running on cloud infrastructure involves a complex workflow:

Write a Dockerfile listing all dependencies and versions
Build the Docker image (wait 5-10 minutes)
Push to cloud container registry (AWS ECR, Google Container Registry)
Configure cloud VMs (EC2/GCE instances with proper networking and security)
Pull and run the image on cloud machines (3-5 minutes per VM)
Rebuild and redeploy every time you add a package (repeat steps 2-5)

This Docker + cloud workflow slows down development and requires expertise in both containerization and cloud infrastructure management.
Coiled’s Solution: Automatic Package Synchronization
Coiled eliminates Docker entirely through automatic package synchronization. Your local environment replicates to cloud workers with no Dockerfile required.
Instead of managing Docker images and cloud infrastructure, you simply add a decorator to your function:
import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_data():
df = pd.read_parquet("s3://my-bucket/data.parquet")
# Your analysis code here
return df.describe()

result = process_data() # Runs on cloud VM with your exact package versions

What Coiled does automatically:

Scans your local environment (pip, conda packages with exact versions)
Creates a dependency manifest (a list of all packages and their versions)
Installs packages on cloud workers with matching versions
Reuses built environments when your dependencies haven’t changed

This is faster than Docker builds in most cases thanks to intelligent caching, and requires zero configuration.
Conclusion
Coiled transforms cloud computing from a multi-day DevOps project into simple Python operations. Whether you’re processing a single large file with Pandas, querying multiple files with Polars, or running distributed aggregations with Dask clusters, Coiled provides the right scaling approach for your needs.
I recommend starting simple with serverless functions for single-file processing, then scale to Dask clusters when you need truly distributed computing. Coiled removes the infrastructure burden from data science workflows, letting you focus on analysis instead of operations.
Related Tutorials

Polars vs Pandas: Performance Benchmarks and When to Switch – Choose the right DataFrame library for your workloads
DuckDB: Fast SQL Analytics on Parquet Files – Master SQL techniques for processing data with DuckDB
PySpark Pandas API: Familiar Syntax at Scale – Explore Spark as an alternative to Dask for distributed processing

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Coiled: Scale Python Data Pipeline to the Cloud in Minutes Read More »

Delta Lake: Transform pandas Prototypes into Production

Table of Contents

Introduction
Introduction to Delta-rs
Setup and Data Preparation
Creating Your First Delta Table
Incremental Updates and CRUD Operations
Time Travel and Data Versioning
Schema Evolution in Action
Selective Updates with Merge Operations
Multi-Engine Integration
Automatic File Cleanup
Conclusion

Introduction
Data scientists face a familiar challenge: pandas works perfectly for prototyping, but production requires enterprise features that traditional file formats can’t provide.
Delta-rs solves this by bringing Delta Lake’s ACID transactions, time travel, and schema evolution to Python without Spark dependencies. It transforms your pandas workflow into production-ready pipelines with minimal code changes.
This tutorial shows you how to build scalable data systems using Delta-rs while maintaining the simplicity that makes pandas so effective.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Introduction to Delta-rs
Delta-rs is a native Rust implementation of Delta Lake for Python. It provides enterprise-grade data lake capabilities without requiring Spark clusters or JVM setup.
Key advantages over traditional file formats:

ACID transactions ensure data consistency during concurrent operations
Time travel enables access to historical data versions
Schema evolution handles data structure changes automatically
Multi-engine support works with pandas, DuckDB, Polars, and more
Efficient updates support upserts and incremental changes without full rewrites

Setup and Data Preparation
Install Delta-rs and supporting libraries:
pip install deltalake pandas duckdb polars

We’ll use actual NYC Yellow Taxi data to demonstrate real-world scenarios. The NYC Taxi & Limousine Commission provides monthly trip records in Parquet format:
import pandas as pd
from deltalake import DeltaTable, write_deltalake
import duckdb
import polars as pl

# Download NYC Yellow Taxi data (June 2024 as example)
# Full dataset available at: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
taxi_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-06.parquet"

# Read a sample of the data for demonstration
sample_data = pd.read_parquet(taxi_url).head(10000)

print(f"Loaded {len(sample_data)} taxi trips from NYC TLC")
print(f"Data shape: {sample_data.shape}")
print(f"Date range: {sample_data['tpep_pickup_datetime'].min()} to {sample_data['tpep_pickup_datetime'].max()}")

sample_data.head()

Output:
Loaded 10000 taxi trips from NYC TLC
Data shape: (10000, 19)
Date range: 2024-05-31 15:33:34 to 2024-06-01 02:59:54
VendorID tpep_pickup_datetime … congestion_surcharge Airport_fee
0 1 2024-06-01 00:03:46 … 0.0 1.75
1 2 2024-06-01 00:55:22 … 0.0 1.75
2 1 2024-06-01 00:23:53 … 0.0 0.00
3 1 2024-06-01 00:32:24 … 2.5 0.00
4 1 2024-06-01 00:51:38 … 2.5 0.00

[5 rows x 19 columns]

Creating Your First Delta Table
Create your first Delta table in the data directory:
write_deltalake("data/taxi_delta_table", sample_data, mode="overwrite")
print("Created Delta table")

# Read back from Delta table
dt = DeltaTable("data/taxi_delta_table")
df_from_delta = dt.to_pandas()

print(f"Delta table contains {len(df_from_delta)} records")

Output:
Created Delta table
Delta table contains 10000 records

View the Delta table structure:
# Inspect Delta table metadata
print("Delta table schema:")
print(dt.schema().to_arrow())

Output:
Delta table schema:
arro3.core.Schema
————
VendorID: Int32
tpep_pickup_datetime: Timestamp(Microsecond, None)
tpep_dropoff_datetime: Timestamp(Microsecond, None)
passenger_count: Float64
trip_distance: Float64

total_amount: Float64
congestion_surcharge: Float64
Airport_fee: Float64

View the current version of the Delta table:
print(f"Current version: {dt.version()}")

Output:
Current version: 0
“`text
## Incremental Updates and CRUD Operations {#incremental-updates-and-crud-operations}

Instead of rewriting entire datasets when adding new records, incremental updates append only what changed. Delta-rs handles these efficient operations natively.

To demonstrate this, we'll simulate late-arriving data:

“`python
# Simulate late-arriving data
late_data = pd.read_parquet(taxi_url).iloc[10000:10050]
print(f"New data to add: {len(late_data)} records")

Output:
New data to add: 50 records

Traditional Approach: Process Everything
The pandas workflow requires loading both existing and new data, combining them, and rewriting the entire output file:
# Pandas approach – reload existing data and merge
existing_df = pd.read_parquet(taxi_url).head(10000)
complete_df = pd.concat([existing_df, late_data])
complete_df.to_parquet("data/taxi_complete.parquet")
print(f"Processed {len(complete_df)} total records")

Output:
Processed 10050 total records

Pandas processed all 10,050 records to add just 50 new ones, demonstrating the inefficiency of full-dataset operations.
Delta-rs Approach: Process Only New Data
Delta-rs appends only the new records without touching existing data:
# Delta-rs – append only what's new
write_deltalake("data/taxi_delta_table", late_data, mode="append")

dt = DeltaTable("data/taxi_delta_table")
print(f"Added {len(late_data)} new records")
print(f"Table version: {dt.version()}")

Output:
Added 50 new records
Table version: 1

Delta-rs processed only the 50 new records while automatically incrementing to version 1, enabling efficient operations and data lineage.
Time Travel and Data Versioning
Time travel and data versioning let you access any previous state of your data. This is essential for auditing changes, recovering from errors, and understanding how data evolved over time without maintaining separate backup files.
Traditional Approach: Manual Backup Strategy
Traditional file-based workflows rely on timestamped copies and manual versioning:
# Traditional pproach – manual timestamped backups
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
df.to_parquet(f"data/taxi_backup_{timestamp}.parquet") # Create manual backup
df_modified.to_parquet("data/taxi_data.parquet") # Overwrite original
# To recover: manually identify and reload backup file

Delta-rs Approach: Built-in Time Travel
Delta-rs automatically tracks every change with instant access to any version:
# Access any historical version instantly
dt_v0 = DeltaTable("data/taxi_delta_table", version=0)
current_dt = DeltaTable("data/taxi_delta_table")

print(f"Version 0: {len(dt_v0.to_pandas())} records")
print(f"Current version: {len(current_dt.to_pandas())} records")
print(f"Available versions: {current_dt.version() + 1}")

Output:
Version 0: 10000 records
Current version: 10050 records
Available versions: 2

Delta-rs maintains 2 complete versions while traditional backups would require separate 57MB files for each timestamp.

📚 For comprehensive production data workflows and version control best practices, check out Production-Ready Data Science.

Schema Evolution in Action
As requirements evolve, you often need to add new columns or change data types. Schema evolution handles these changes automatically, letting you update your data structure without breaking existing queries or reprocessing historical records.
To demonstrate this, imagine NYC’s taxi authority introduces weather tracking and surge pricing features, requiring your pipeline to handle new weather_condition and surge_multiplier columns alongside existing fare data.
# Copy the existing data
enhanced_data = pd.read_parquet(taxi_url).iloc[20000:20100].copy()

# Simulate new data with additional business columns
weather_options = ['clear', 'rain', 'snow', 'cloudy']
surge_options = [1.0, 1.2, 1.5, 2.0]
enhanced_data['weather_condition'] = [weather_options[i % 4] for i in range(len(enhanced_data))]
enhanced_data['surge_multiplier'] = [surge_options[i % 4] for i in range(len(enhanced_data))]

print(f"Enhanced data: {len(enhanced_data)} records with {len(enhanced_data.columns)} columns")
print(f"New columns: {[col for col in enhanced_data.columns if col not in sample_data.columns]}")

Output:
Enhanced data: 100 records with 21 columns
New columns: ['weather_condition', 'surge_multiplier']

Traditional Approach: No Schema History
Traditional formats provide no tracking of schema changes or evolution history:
# Traditional approach – no schema versioning or history
df_v1 = pd.read_parquet("taxi_v1.parquet") # Original schema
df_v2 = pd.read_parquet("taxi_v2.parquet") # Enhanced schema

Delta-rs Approach: Schema Versioning and History
Delta-rs automatically merges schemas while tracking every change:
# Schema evolution with automatic versioning
write_deltalake(
"data/taxi_delta_table",
enhanced_data,
mode="append",
schema_mode="merge"
)

dt = DeltaTable("data/taxi_delta_table")
print(f"Schema evolved: {len(dt.to_pandas().columns)} columns | Version: {dt.version()}")

Output:
Schema evolved: 21 columns | Version: 2

Explore the complete schema evolution history and access any previous version:
# View schema change history
history = dt.history()
for entry in history[:2]:
print(f"Version {entry['version']}: {entry['operation']} at {entry['timestamp']}")

# Access different schema versions
original_schema = DeltaTable("data/taxi_delta_table", version=0)
print(f"\nOriginal schema (v0): {len(original_schema.to_pandas().columns)} columns")
print(f"Current schema (v{dt.version()}): {len(dt.to_pandas().columns)} columns")

Output:
Version 2: WRITE at 1755180763083
Version 1: WRITE at 1755180762968

Original schema (v0): 19 columns
Current schema (v2): 21 columns

Delta-rs expanded from 19 to 21 columns across 10,150 records without schema migration scripts or pipeline failures.
Selective Updates with Merge Operations
Merge operations combine updates and inserts in a single transaction based on matching conditions. This eliminates the need to process entire datasets when you only need to modify specific records, dramatically improving efficiency at scale.
To demonstrate this, let’s create a simple taxi trips table:
# Create initial Delta table with 5 trips
trips = pd.DataFrame({
'trip_id': [1, 2, 3, 4, 5],
'fare_amount': [15.5, 20.0, 18.3, 12.5, 25.0],
'payment_type': [1, 1, 2, 1, 2]
})
write_deltalake("data/trips_merge_demo", trips, mode="overwrite")
print("Initial trips:")
print(trips)

Output:
Initial trips:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 20.0 1
2 3 18.3 2
3 4 12.5 1
4 5 25.0 2

Here are the updates we want to make:

Update trip 2: change fare from $20.00 to $22.00
Update trip 4: change fare from $12.50 to $13.80
Insert trip 6: new trip with fare $30.00
Insert trip 7: new trip with fare $16.50

Traditional Approach: Full Dataset Processing
Traditional workflows require loading complete datasets, identifying matches, and rewriting all records. This process becomes increasingly expensive as data grows:
# Traditional approach – load, modify, and rewrite everything
existing_df = trips.copy()

# Updates: manually locate and modify rows
existing_df.loc[existing_df['trip_id'] == 2, 'fare_amount'] = 22.0
existing_df.loc[existing_df['trip_id'] == 4, 'fare_amount'] = 13.8

# Inserts: create new rows and append
new_trips = pd.DataFrame({
'trip_id': [6, 7],
'fare_amount': [30.0, 16.5],
'payment_type': [1, 1]
})
updated_df = pd.concat([existing_df, new_trips], ignore_index=True)

# Rewrite entire dataset
updated_df.to_parquet("data/trips_traditional.parquet")
print(updated_df)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 1 # Updated
2 3 18.3 2
3 4 13.8 1 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs Approach: Upsert with Merge Operations
Delta-rs merge operations handle both updates and inserts in a single atomic operation, processing only affected records:
# Prepare changes: 2 updates + 2 inserts
changes = pd.DataFrame({
'trip_id': [2, 4, 6, 7],
'fare_amount': [22.0, 13.8, 30.0, 16.5],
'payment_type': [2, 2, 1, 1]
})

# Load Delta table
dt = DeltaTable("data/trips_merge_demo")

# Upsert operation: update existing, insert new
(
dt.merge(
source=changes,
predicate="target.trip_id = source.trip_id",
source_alias="source",
target_alias="target",
)
.when_matched_update(
updates={
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.when_not_matched_insert(
updates={
"trip_id": "source.trip_id",
"fare_amount": "source.fare_amount",
"payment_type": "source.payment_type",
}
)
.execute()
)

# Verify results
result = dt.to_pandas().sort_values('trip_id').reset_index(drop=True)
print(result)

Output:
trip_id fare_amount payment_type
0 1 15.5 1
1 2 22.0 2 # Updated
2 3 18.3 2
3 4 13.8 2 # Updated
4 5 25.0 2
5 6 30.0 1 # Inserted
6 7 16.5 1 # Inserted

Delta-rs processed exactly 4 records (2 updates + 2 inserts) while pandas processed all 7 records. This efficiency compounds dramatically with larger datasets.
Multi-Engine Integration
Different teams often use different tools: pandas for exploration, DuckDB for SQL queries, Polars for performance. Multi-engine support lets all these tools access the same data directly without creating duplicates or writing conversion scripts.
Traditional Approach: Engine-Specific Optimization Requirements
Each engine needs different file optimizations that don’t transfer between tools:
Start with the original dataset:
# Traditional approach – Each engine needs different optimizations
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
df = pd.DataFrame(data)

The Pandas team optimizes for indexed lookups:
# Pandas team needs indexed Parquet for fast lookups
df.to_parquet("data/pandas_optimized.parquet", index=True)
pandas_result = pd.read_parquet("data/pandas_optimized.parquet")
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

The Polars team needs sorted data for predicate pushdown optimization:
# Polars team needs sorted columns for predicate pushdown
df.sort_values('payment_type').to_parquet("data/polars_optimized.parquet")
polars_result = pl.read_parquet("data/polars_optimized.parquet").select([
pl.len().alias("trips"), pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

The DuckDB team requires specific compression for query performance:
# DuckDB needs specific compression/statistics for query planning
df.to_parquet("data/duckdb_optimized.parquet", compression='zstd')
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM 'data/duckdb_optimized.parquet'
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $18.26 avg

Delta-rs Approach: Universal Optimizations
Delta-rs provides built-in optimizations that benefit all engines simultaneously:
Create one optimized Delta table that serves all engines:
# Delta-rs approach – Universal optimizations for all engines
from deltalake import write_deltalake, DeltaTable
import polars as pl
import duckdb

# Create Delta table with built-in optimizations:
data = {"payment_type": [1, 1, 2, 1, 2], "fare_amount": [15.5, 20.0, 18.3, 12.5, 25.0]}
write_deltalake("data/universal_demo", pd.DataFrame(data))

Pandas benefits from Delta’s statistics for efficient filtering:
# Pandas gets automatic optimization benefits
dt = DeltaTable("data/universal_demo")
pandas_result = dt.to_pandas()
print(f"Pandas: {len(pandas_result)} trips, avg ${pandas_result['fare_amount'].mean():.2f}")

Output:
Pandas: 5 trips, avg $17.66

Polars leverages Delta’s column statistics for predicate pushdown:
# Polars gets predicate pushdown optimization automatically
polars_result = pl.read_delta("data/universal_demo").select([
pl.len().alias("trips"),
pl.col("fare_amount").mean().alias("avg_fare")
])
print(f"Polars: {polars_result}")

Output:
Polars: shape: (1, 2)
┌───────┬──────────┐
│ trips ┆ avg_fare │
│ — ┆ — │
│ u32 ┆ f64 │
╞═══════╪══════════╡
│ 5 ┆ 18.26 │
└───────┴──────────┘

DuckDB uses Delta’s statistics for query planning optimization:
# DuckDB gets optimized query plans from Delta statistics
duckdb_result = duckdb.execute("""
SELECT COUNT(*) as trips, ROUND(AVG(fare_amount), 2) as avg_fare
FROM delta_scan('data/universal_demo')
""").fetchone()
print(f"DuckDB: {duckdb_result[0]} trips, ${duckdb_result[1]} avg")

Output:
DuckDB: 5 trips, $17.66

One Delta table with universal optimizations benefiting all engines.
Automatic File Cleanup
Every data update creates new files while keeping old versions for time travel. Vacuum identifies files older than your retention period and safely deletes them, freeing storage space without affecting active data or recent history.
Traditional Approach: Manual Cleanup Scripts
Traditional workflows require custom scripts to manage file cleanup:
# Traditional approach – manual file management
import os
import glob
from datetime import datetime, timedelta

# Find old backup files manually
old_files = []
cutoff_date = datetime.now() – timedelta(days=7)
for file in glob.glob("data/taxi_backup_*.parquet"):
file_time = datetime.fromtimestamp(os.path.getmtime(file))
if file_time < cutoff_date:
old_files.append(file)
os.remove(file) # Manual cleanup with risk

Delta-rs Approach: Built-in Vacuum Operation
Delta-rs provides safe, automated cleanup through its vacuum() operation, which removes unused transaction files while preserving data integrity. Files become unused when:
• UPDATE operations create new versions, leaving old data files unreferenced
• DELETE operations remove data, making those files obsolete
• Failed transactions leave temporary files that were never committed
• Table optimization consolidates small files, making originals unnecessary
# Delta-rs vacuum removes unused files safely with ACID protection
from deltalake import DeltaTable
import os

def get_size(path):
"""Calculate total directory size in MB"""
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
total_size += os.path.getsize(os.path.join(dirpath, filename))
return total_size / (1024 * 1024)

With our size calculation helper in place, let’s measure storage before and after vacuum:
dt = DeltaTable("data/taxi_delta_table")

# Measure storage before cleanup
before_size = get_size("data/taxi_delta_table")

# Safe cleanup – files only deleted if no active readers/writers
dt.vacuum(retention_hours=168) # Built-in safety: won't delete files in use

# Measure storage after cleanup
after_size = get_size("data/taxi_delta_table")

print(f"Delta vacuum completed safely")
print(f"Storage before: {before_size:.1f} MB")
print(f"Storage after: {after_size:.1f} MB")
print(f"Space reclaimed: {before_size – after_size:.1f} MB")

Output:
Delta vacuum completed safely
Storage before: 8.2 MB
Storage after: 5.7 MB
Space reclaimed: 2.5 MB

Delta vacuum removed 2.5 MB of obsolete file versions, reducing storage footprint by 30% while maintaining ACID transaction guarantees and time travel capabilities.
Conclusion
Delta-rs transforms the traditional pandas workflow by providing:

Incremental updates append only changed records without full rewrites
Time travel and versioning enable recovery and auditing without manual backups
Schema evolution handles column changes without breaking queries
Merge operations combine updates and inserts in single transactions
Multi-engine support lets pandas, DuckDB, and Polars access the same data
Automatic vacuum reclaims storage by removing obsolete file versions

The bridge from pandas prototyping to production data pipelines no longer requires complex infrastructure. Delta-rs provides the reliability and performance you need while maintaining the simplicity you want.
Related Tutorials

Alternative Scaling: Scaling Pandas Workflows with PySpark’s Pandas API for Spark-based approaches
Data Versioning: Version Control for Data and Models Using DVC for broader versioning strategies
DataFrame Performance: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrame optimization techniques

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Delta Lake: Transform pandas Prototypes into Production Read More »

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Table of Contents

Introduction
Why Consider Polars?
Setup
Multi-Core Performance
Timing Decorator
Plotting Setup
Reading CSV Files
Groupby Mean
Filter Rows
Sort by Column
Observations

Lazy Evaluation (Only in Polars)
Syntax Comparison
Filtering rows
Selecting columns
Chained operations

Memory Efficiency
Missing Features (Where Pandas Wins)
Summary
When to Use Polars vs. Pandas

Final Thoughts

Introduction
Pandas has been the standard tool for working with tabular data in Python for many years. But as datasets get larger and performance becomes more important, Polars (an increasingly popular choice for handling data in Polars Python) is a faster alternative that uses multiple CPU cores.
This guide compares Polars and Pandas, highlighting where Polars shines and showing how to get started with it.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Key Takeaways
Here’s what you’ll learn:

Achieve 9× faster CSV reading with Polars’ multi-threaded architecture
Reduce memory usage by 30-50% using Apache Arrow backend optimization
Enable lazy evaluation for automatic query optimization and predicate pushdown
Implement seamless migration from Pandas with similar API patterns
Scale data processing without changing code using built-in parallel execution

Why Consider Polars?
Polars is a DataFrame library written in Rust, designed for high performance. Key advantages:

Multi-threaded by default
Lazy execution for optimization
Low memory usage
Strong support for Arrow and Parquet

Let’s explore how Polars improves upon Pandas with practical examples.
Setup
Install both libraries and generate a sample dataset for testing:
pip install pandas polars

To follow along with the examples, you can generate a large CSV file like this:
import pandas as pd
import numpy as np

# Create a large dataset
n_rows = 10_000_000
data = {
"category": np.random.choice(["A", "B", "C", "D"], size=n_rows),
"value": np.random.rand(n_rows) * 1000
}
df = pd.DataFrame(data)

This will create a large_file.csv with 10 million rows for performance testing.
Multi-Core Performance
Pandas uses multiple CPU cores for data processing, while Pandas relies on a single core. Thus, Polars manages larger datasets more efficiently and often performs operations faster than Pandas.

Let’s benchmark a few common operations using Pandas and Polars: reading CSVs, groupby, filter, and sort.
Timing Decorator
We define a reusable timing decorator to simplify benchmarking in this section:
import time
from functools import wraps

def timeit(operation_name):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
execution_time = end_time – start_time
print(f"{operation_name} took {execution_time:.2f} seconds")
return result, execution_time
return wrapper
return decorator

Plotting Setup
Let’s define a helper function to simplify chart creation for each operation.
import matplotlib.pyplot as plt
import seaborn as sns

def create_comparison_plot(pandas_time, polars_time, title, plt, sns):
sns.barplot(
hue=["Pandas", "Polars"],
y=[pandas_time, polars_time],
palette=["#E583B6", "#72BEFA"],
)
plt.title(f"{title} (seconds)")
plt.ylabel("Time (s)")
plt.show()

Reading CSV Files
Pandas:
@timeit("Pandas read_csv")
def read_pandas():
return pd.read_csv("large_file.csv")

pandas_df, pandas_read_time = read_pandas()

Output:
Pandas read_csv took 1.38 seconds

Polars:
import polars as pl

@timeit("Polars read_csv")
def read_polars():
return pl.read_csv("large_file.csv")

polars_df, polars_read_time = read_polars()

Output:
Polars read_csv took 0.15 seconds

create_comparison_plot(pandas_read_time, polars_read_time, "CSV Read Time")

Groupby Mean
Pandas:
@timeit("Pandas groupby")
def pandas_groupby(df):
return df.groupby("category")["value"].mean()

pandas_result, pandas_groupby_time = pandas_groupby(pandas_df)

Output:
Pandas groupby took 0.53 seconds

Polars:
@timeit("Polars groupby")
def polars_groupby(df):
return df.group_by("category").agg(pl.col("value").mean())

polars_result, polars_groupby_time = polars_groupby(polars_df)

Output:
Polars groupby took 0.20 seconds

create_comparison_plot(pandas_groupby_time, polars_groupby_time, "Groupby Mean Time")

Filter Rows
Pandas:
@timeit("Pandas filter")
def pandas_filter(df):
return df[df["value"] > 500]

pandas_filtered, pandas_filter_time = pandas_filter(pandas_df)

Output:
Pandas filter took 0.15 seconds

Polars:
@timeit("Polars filter")
def polars_filter(df):
return df.filter(pl.col("value") > 500)

polars_filtered, polars_filter_time = polars_filter(polars_df)

Output:
Polars filter took 0.03 seconds

create_comparison_plot(pandas_filter_time, polars_filter_time, "Filter Rows Time")

Sort by Column
Pandas:
@timeit("Pandas sort")
def pandas_sort(df):
return df.sort_values("value")

pandas_sorted, pandas_sort_time = pandas_sort(pandas_df)

Output:
Pandas sort took 2.74 seconds

Polars:
@timeit("Polars sort")
def polars_sort(df):
return df.sort("value")

polars_sorted, polars_sort_time = polars_sort(polars_df)

Output:
Polars sort took 0.49 seconds

create_comparison_plot(pandas_sort_time, polars_sort_time, "Sort Time")

Observations
Polars consistently outperformed Pandas across all operations:

CSV Reading: 9× faster
Groupby: Over 2.5× faster
Filtering: 5× faster
Sorting: 5.5× faster

This performance advantage is largely due to Polars’ native Rust engine and multi-threaded execution. It’s especially valuable for large-scale data processing pipelines.
For SQL-based analytics that complement Polars’ performance advantages, see our DuckDB deep dive guide.
Lazy Evaluation (Only in Polars)
Pandas executes operations eagerly, which can be inefficient for large pipelines. For example:
(
pd.read_csv("large_file.csv")
.query('value > 100')
.groupby('category', as_index=False)['value']
.mean()
.rename(columns={"value": "avg_value"})
)

This code is inefficient because:

Full CSV Load: Pandas loads the entire file into memory, even rows or columns we won’t use.
No Predicate Pushdown: The .query() filter is applied after reading all rows, not during the read.
No Projection Pushdown: All columns are read from disk, even if we only use value and category.

Polars supports lazy evaluation, which plans and optimizes the entire computation graph before execution:
query = (
pl.read_csv("large_file.csv").lazy()
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value"))
)

print(query.explain())

Output:
FILTER [(col("avg_value")) > (100.0)]
FROM
AGGREGATE
[col("value").mean().alias("avg_value")] BY [col("category")]
FROM
DF ["category", "value"]; PROJECT["value", "category"] 2/2 COLUMNS

Optimizations used by Polars:

Lazy Execution: The computation is deferred and optimized before execution.
Predicate Pushdown: Filters like value > 100 are applied during the scan, skipping irrelevant rows.
Projection Pushdown: Only necessary columns ( category and value) are read.
Single-Pass Execution: Operations can be executed in a streaming manner, improving performance.

The pipeline is only executed when .collect() is called:
result = query.collect()
result.head()

Output:
shape: (4, 2)
┌──────────┬────────────┐
│ category ┆ avg_value │
│ — ┆ — │
│ str ┆ f64 │
╞══════════╪════════════╡
│ A ┆ 550.245739 │
│ C ┆ 549.985808 │
│ D ┆ 549.649933 │
│ B ┆ 550.055504 │
└──────────┴────────────┘

Syntax Comparison
Both libraries offer expressive APIs, but Polars’ syntax is more functional and chainable, which makes it easier to write pipelines with multiple operations in a single expression.
Filtering rows
Pandas:
df_pd.query("value > 1000")

Polars:
df_pl.filter(pl.col("value") > 1000)

Selecting columns
Pandas:
df_pd[["category", "value"]]

Polars:
df_pl.select(["category", "value"])

Polars feels more like writing SQL pipelines, especially in lazy mode.
Chained operations
Pandas:
(df_pd.query("value > 1000")
.groupby("category")["value"]
.mean()
.reset_index())

Polars:
(df_pl
.filter(pl.col("value") > 1000)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value")))

Polars lets you chain operations together, which often makes code more readable and efficient.
Memory Efficiency
Polars has its own columnar and compact implementation optimized for performance. This design avoids Python object overhead, unlike Pandas.
df_pd.memory_usage(deep=True).sum() / 1e6 # MB
df_pl.estimated_size() / 1e6 # MB

Polars usually consumes less memory, especially with strings or categorical data.
Missing Features (Where Pandas Wins)
While Polars is fast, Pandas is still better at:

Interoperability with other Python libraries (e.g., matplotlib, scikit-learn)
Rich ecosystem (e.g., pandas-profiling, pandasql)
More mature and flexible time series tools

Summary
When to Use Polars vs. Pandas

Scenario
Use Polars ✅
Use Pandas ✅

Large datasets and file sizes
✅ Yes
⚠️ May struggle with memory

Fast CSV/Parquet I/O
✅ Much faster
🚫 Slower for large files

Multi-core performance
✅ Built-in multi-threading
🚫 Single-threaded by default

Lazy evaluation for optimized pipelines
✅ Supported
🚫 Not available

Low memory consumption
✅ Apache Arrow backend
🚫 Higher memory overhead

Integration with scikit-learn, statsmodels, etc.
🚫 Limited
✅ Excellent ecosystem integration

Flexible time series handling
🚫 Basic support
✅ Advanced and robust

Small-to-medium datasets
⚠️ May be overkill
✅ Simple and effective

For scaling beyond single-machine processing, see our PySpark Pandas API guide.
Final Thoughts
Polars is not here to replace Pandas, but it’s an excellent choice when performance matters. If your workflows are slow or memory-bound, Polars can drastically reduce execution time without compromising expressiveness.
Tip: You can even mix both libraries. Load and process large files with Polars, then convert to Pandas when needed:
df_pd = df_pl.to_pandas()

For modern Python development best practices that complement Polars’ performance focus, see our UV package manager guide.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames Read More »

DuckDB: Simplify DataFrame Analysis with Serverless SQL

Using SQL with pandas empowers data scientists to leverage SQL’s powerful querying capabilities alongside the data manipulation functionalities of pandas.

However, traditional database systems often demand the management of a separate DBMS server, introducing additional complexity to the workflow.

With DuckDB, you can efficiently run SQL operations on pandas DataFrames without the need to manage a separate DBMS server.

DuckDB: Simplify DataFrame Analysis with Serverless SQL Read More »

Pandera: Data Validation Made Simple for Python DataFrames

Poor data quality can lead to incorrect conclusions and bad model performance. Thus, it is important to check data for consistency and reliability before using it.

pandera makes it easy to perform data validation on dataframe-like objects. If the dataframe does not pass validation checks, pandera provides useful error messages.

Pandera: Data Validation Made Simple for Python DataFrames Read More »

DuckDB + PyArrow: 2900x Faster Than pandas for Large Dataset Processing

DuckDB optimizes query execution with multiple optimizations, while PyArrow efficiently manages in-memory data processing and storage. Combining DuckDB and PyArrow allows you to efficiently process datasets larger than memory on a single machine.

In the code above, we convert a Delta Lake table with over 6 million rows to a pandas DataFrame and a PyArrow dataset, which are then used by DuckDB.

Running DuckDB on a PyArrow dataset is approximately 2906 times faster than running DuckDB on a pandas DataFrame.

DuckDB + PyArrow: 2900x Faster Than pandas for Large Dataset Processing Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran

    Seraphinite AcceleratorOptimized by Seraphinite Accelerator
    Turns on site high speed to be attractive for people and search engines.