Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

dataframes

Portable DataFrames in Python: When to Use Ibis, Narwhals, or Fugue

Table of Contents

Introduction
Setup
Ibis: Compile Once, Run Anywhere
Narwhals: The Zero-Dependency Compatibility Layer
Fugue: Keep Your Code, Swap the Engine
Summary

Introduction
The backend you start with is not always the backend you finish with. Teams commonly prototype in pandas, scale in Spark, or transition from DuckDB to a warehouse environment. Maintaining separate implementations of the same pipeline across backends can quickly become costly and error-prone.
Rather than reimplementing the same pipeline, you can define the logic once and execute it on different backends with one of these tools:

Ibis: Uses its own expression API and compiles it to backend-native SQL. Best for data scientists working across SQL systems.
Narwhals: Exposes a Polars-like API on top of the user’s existing dataframe library. Best for library authors building dataframe-agnostic tools.
Fugue: Runs Python functions and FugueSQL across distributed engines. Best for data engineers scaling pandas workflows to Spark, Dask, or Ray.

In this article, we’ll walk through these three tools side by side so you can choose the portability approach that best fits your workflow.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Setup
All examples in this article use the NYC Yellow Taxi dataset (January 2024, ~3M rows). Download the Parquet file before getting started:
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

Ibis: Compile Once, Run Anywhere
Ibis provides a declarative Python API that compiles to SQL:

What it does: Translates Python expressions into backend-native SQL for DuckDB, PostgreSQL, BigQuery, Snowflake, and 22+ other backends
How it works: Compiles expressions to SQL and delegates execution to the backend engine
Who it’s for: Data scientists working across SQL systems who want one API for all backends

In other words, Ibis compiles your code to SQL for the backend, then lets you collect results as pandas, Polars, or PyArrow:

To get started, install Ibis with the DuckDB backend since DuckDB is fast and needs no server setup:
pip install 'ibis-framework[duckdb]'

This article uses ibis-framework v12.0.0.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Expression API and Lazy Execution
Ibis uses lazy evaluation. Nothing executes until you explicitly request results. This allows the backend’s query optimizer to plan the most efficient execution.
First, connect to DuckDB as the execution backend. The read_parquet call registers the file with DuckDB rather than loading it into memory, keeping the workflow lazy from the start:
import ibis

con = ibis.duckdb.connect()
t = con.read_parquet("yellow_tripdata_2024-01.parquet")

Next, define the analysis using Ibis’s expression API. Since Ibis is lazy, this only builds an expression tree without touching the data:
result = (
t.group_by("payment_type")
.aggregate(
total_fare=t.fare_amount.sum(),
avg_fare=t.fare_amount.mean(),
trip_count=t.count(),
)
.order_by(ibis.desc("trip_count"))
)

Finally, call .to_pandas() to trigger execution. DuckDB runs the query and returns the result as a pandas DataFrame:
df = result.to_pandas()
print(df)

payment_type total_fare avg_fare trip_count
0 1 43035538.92 18.557432 2319046
1 2 7846602.79 17.866037 439191
2 0 2805509.77 20.016194 140162
3 4 62243.19 1.334889 46628
4 3 132330.09 6.752569 19597

Inspecting Generated SQL
Because each backend executes native SQL (not Python), Ibis lets you inspect the compiled query with ibis.to_sql(). This is useful for debugging performance or auditing the exact SQL sent to your database:
print(ibis.to_sql(result))

SELECT
*
FROM (
SELECT
"t0"."payment_type",
SUM("t0"."fare_amount") AS "total_fare",
AVG("t0"."fare_amount") AS "avg_fare",
COUNT(*) AS "trip_count"
FROM "ibis_read_parquet_hdy7njbsxfhbjcet43die5ahvu" AS "t0"
GROUP BY
1
) AS "t1"
ORDER BY
"t1"."trip_count" DESC

Backend Switching
To run the same logic on PostgreSQL, you only change the connection. The expression code stays identical:
# Switch to PostgreSQL: only the connection changes
con = ibis.postgres.connect(host="db.example.com", database="taxi")
t = con.table("yellow_tripdata")

# The same expression code works without any changes
result = (
t.group_by("payment_type")
.aggregate(
total_fare=t.fare_amount.sum(),
avg_fare=t.fare_amount.mean(),
trip_count=t.count(),
)
.order_by(ibis.desc("trip_count"))
)

Ibis supports 22+ backends including DuckDB, PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, Databricks, ClickHouse, Trino, PySpark, Polars, DataFusion, and more. Each backend receives SQL optimized for its specific dialect.

For a comprehensive guide to PySpark SQL, see The Complete PySpark SQL Guide.

Filtering and Chaining
Ibis expressions chain naturally. The syntax is close to Polars, with methods like .filter(), .select(), and .order_by():
high_fare_trips = (
t.filter(t.fare_amount > 20)
.filter(t.trip_distance > 5)
.select("payment_type", "fare_amount", "trip_distance", "tip_amount")
.order_by(ibis.desc("fare_amount"))
.limit(5)
)

print(high_fare_trips.to_pandas())

payment_type fare_amount trip_distance tip_amount
0 2 2221.3 31.95 0.0
1 2 1616.5 233.25 0.0
2 2 912.3 142.62 0.0
3 2 899.0 157.25 0.0
4 2 761.1 109.75 0.0

Narwhals: The Zero-Dependency Compatibility Layer
While Ibis compiles to SQL, Narwhals works directly with Python dataframe libraries:

What it does: Wraps existing dataframe libraries (pandas, Polars, PyArrow) with a thin, Polars-like API
How it works: Translates calls directly to the underlying library instead of compiling to SQL
Who it’s for: Library authors who want their package to accept any dataframe type without adding dependencies

If you’re building a library, you don’t control which dataframe library your users prefer. Without Narwhals, you’d maintain separate implementations for each library:

With Narwhals, one code path replaces all three, so every bug fix and feature update applies to all dataframe types at once:

To install Narwhals, run:
pip install narwhals

This article uses narwhals v2.16.0.

For more Narwhals examples across pandas, Polars, and PySpark, see Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark.

The from_native / to_native Pattern
The core pattern has three steps: convert the incoming dataframe to Narwhals, do your work, and convert back.
First, load a pandas DataFrame and wrap it with nw.from_native(). This gives you a Narwhals DataFrame with a Polars-like API:
import narwhals as nw
import pandas as pd

df_pd = pd.read_parquet("yellow_tripdata_2024-01.parquet")
df = nw.from_native(df_pd)

Next, use Narwhals’ API to define the analysis. The syntax mirrors Polars with nw.col(), .agg(), and .sort():
result = (
df.group_by("payment_type")
.agg(
nw.col("fare_amount").sum().alias("total_fare"),
nw.col("fare_amount").mean().alias("avg_fare"),
nw.col("fare_amount").count().alias("trip_count"),
)
.sort("trip_count", descending=True)
)

Finally, call .to_native() to convert back to the original library. Since we started with pandas, we get a pandas DataFrame back:
print(result.to_native())

payment_type total_fare avg_fare trip_count
1 1 43035538.92 18.557432 2319046
0 2 7846602.79 17.866037 439191
4 0 2805509.77 20.016194 140162
2 4 62243.19 1.334889 46628
3 3 132330.09 6.752569 19597

To see the real benefit, wrap this logic in a reusable function. It accepts any supported dataframe type, and Narwhals handles the rest:
def fare_summary(df_native):
df = nw.from_native(df_native)
return (
df.group_by("payment_type")
.agg(
nw.col("fare_amount").sum().alias("total_fare"),
nw.col("fare_amount").mean().alias("avg_fare"),
nw.col("fare_amount").count().alias("trip_count"),
)
.sort("trip_count", descending=True)
.to_native()
)

Now the same function works with pandas, Polars, and PyArrow:
print(fare_summary(df_pd))

payment_type total_fare avg_fare trip_count
0 1 3.704733e+07 16.689498 2219230
1 2 7.352498e+06 15.411498 477083
2 0 1.396918e+06 19.569349 71382
3 4 1.280650e+05 15.671294 8173
4 3 1.108880e+04 12.906526 859

import polars as pl

df_pl = pl.read_parquet("yellow_tripdata_2024-01.parquet")
print(fare_summary(df_pl))

shape: (5, 4)
┌──────────────┬────────────┬───────────┬────────────┐
│ payment_type ┆ total_fare ┆ avg_fare ┆ trip_count │
│ — ┆ — ┆ — ┆ — │
│ i64 ┆ f64 ┆ f64 ┆ u32 │
╞══════════════╪════════════╪═══════════╪════════════╡
│ 1 ┆ 4.3036e7 ┆ 18.557432 ┆ 2319046 │
│ 2 ┆ 7.8466e6 ┆ 17.866037 ┆ 439191 │
│ 0 ┆ 2.8055e6 ┆ 20.016194 ┆ 140162 │
│ 4 ┆ 62243.19 ┆ 1.334889 ┆ 46628 │
│ 3 ┆ 132330.09 ┆ 6.752569 ┆ 19597 │
└──────────────┴────────────┴───────────┴────────────┘

import duckdb

df_duck = duckdb.sql("SELECT * FROM 'yellow_tripdata_2024-01.parquet'")
print(fare_summary(df_duck))

┌──────────────┬────────────────────┬────────────────────┬────────────┐
│ payment_type │ total_fare │ avg_fare │ trip_count │
│ int64 │ double │ double │ int64 │
├──────────────┼────────────────────┼────────────────────┼────────────┤
│ 1 │ 43035538.92000025 │ 18.557432202724847 │ 2319046 │
│ 2 │ 7846602.7900001 │ 17.86603730495411 │ 439191 │
│ 0 │ 2805509.7700004894 │ 20.016193904200065 │ 140162 │
│ 4 │ 62243.19000000006 │ 1.3348886934888922 │ 46628 │
│ 3 │ 132330.08999999985 │ 6.752568760524563 │ 19597 │
└──────────────┴────────────────────┴────────────────────┴────────────┘

Notice that the output type always matches the input. This is what makes Narwhals practical for library authors: users keep working with their preferred dataframe library, and your code stays the same.

For a detailed comparison of pandas, Polars, and DuckDB themselves, see pandas vs Polars vs DuckDB: A Data Scientist’s Guide.

The @narwhalify Decorator
For simpler cases, the @nw.narwhalify decorator handles the from_native/to_native boilerplate:
@nw.narwhalify
def high_fare_filter(df, threshold: float = 20.0):
return (
df.filter(nw.col("fare_amount") > threshold)
.select("payment_type", "fare_amount", "trip_distance", "tip_amount")
.sort("fare_amount", descending=True)
.head(5)
)

print(high_fare_filter(df_pd, threshold=50.0))

payment_type fare_amount trip_distance tip_amount
0 2 401200.0 42.39 0.00
1 1 398.0 39.49 50.00
2 1 397.5 42.83 50.00
3 1 384.5 45.33 77.60
4 1 363.0 32.43 73.30

Real-World Adoption
Narwhals has seen wide adoption across the Python ecosystem. Over 25 libraries use it, including:

Visualization: Altair, Plotly, Bokeh
ML: scikit-lego, fairlearn
Interactive: marimo
Forecasting: darts, hierarchicalforecast

These libraries accept any dataframe type from users because Narwhals handles the compatibility layer with zero additional dependencies.
Fugue: Keep Your Code, Swap the Engine
Fugue focuses on scaling existing code to distributed engines:

What it does: Ships your pandas functions to Spark, Dask, or Ray without rewriting them
How it works: Uses type annotations to infer input/output schemas, then translates execution to the target engine
Who it’s for: Data engineers who already have pandas pipelines and need to scale them

In other words, your existing pandas code runs as-is on distributed engines:

To install Fugue, run:
pip install fugue

This article uses fugue v0.9.6.
The transform() Pattern
With Fugue, you begin with a regular pandas function. No Fugue imports or special decorators needed. The only requirement is type annotations, which tell Fugue how to handle the data when it runs on a different engine:
import pandas as pd

def fare_summary(df: pd.DataFrame) -> pd.DataFrame:
return (
df.groupby("payment_type")
.agg(
total_fare=("fare_amount", "sum"),
avg_fare=("fare_amount", "mean"),
trip_count=("fare_amount", "count"),
)
.reset_index()
.sort_values("trip_count", ascending=False)
)

Since this is plain pandas, you can test it locally as you normally would:
input_df = pd.read_parquet("yellow_tripdata_2024-01.parquet")
result = fare_summary(input_df)
print(result)

payment_type total_fare avg_fare trip_count
1 1 43035538.92 18.557432 2319046
2 2 7846602.79 17.866037 439191
0 0 2805509.77 20.016194 140162
4 4 62243.19 1.334889 46628
3 3 132330.09 6.752569 19597

To scale this to Spark, pass the function through Fugue’s transform() with a different engine:
from fugue import transform

# Scale to Spark: same function, different engine
result_spark = transform(
input_df,
fare_summary,
schema="payment_type:int,total_fare:double,avg_fare:double,trip_count:long",
engine="spark",
)

result_spark.show()

+————+———-+——–+———-+
|payment_type|total_fare|avg_fare|trip_count|
+————+———-+——–+———-+
| 1| 43035539 | 18.56 | 2319046|
| 2| 7846603 | 17.87 | 439191|
| 0| 2805510 | 20.02 | 140162|
| 4| 62243 | 1.33 | 46628|
| 3| 132330 | 6.75 | 19597|
+————+———-+——–+———-+

The schema parameter is required for distributed engines because frameworks like Spark need to know column types before execution. This is a constraint of distributed computing, not Fugue itself.
Scaling to DuckDB
For local speedups without distributed infrastructure, Fugue also supports DuckDB as an engine:
result_duck = transform(
input_df,
fare_summary,
schema="payment_type:int,total_fare:double,avg_fare:double,trip_count:long",
engine="duckdb",
)
print(result_duck)

payment_type total_fare avg_fare trip_count
0 1 43035538.92 18.557432 2319046
1 2 7846602.79 17.866037 439191
2 0 2805509.77 20.016194 140162
3 4 62243.19 1.334889 46628
4 3 132330.09 6.752569 19597

Notice that the function never changes. Fugue handles the conversion between pandas and each engine automatically.
FugueSQL
Fugue also includes FugueSQL, which lets you write SQL that calls Python functions. To use it, install Fugue with the SQL extra:
pip install 'fugue[sql]'

This article uses fugue v0.9.6.
This is useful for teams that prefer SQL but need custom transformations:
from fugue.api import fugue_sql

result = fugue_sql("""
SELECT payment_type, fare_amount, trip_distance
FROM input_df
WHERE fare_amount > 50
ORDER BY fare_amount DESC
LIMIT 5
""")
print(result)

payment_type fare_amount trip_distance
1714869 3 5000.0 0.0
1714870 3 5000.0 0.0
1714871 3 2500.0 0.0
1714873 1 2500.0 0.0
2084560 1 2500.0 0.0

Summary

Tool
Choose when you need to…
Strengths

Ibis
Query SQL databases, deploy from local to cloud, or want one API across 22+ backends
Compiles to native SQL; DuckDB locally, BigQuery/Snowflake in production

Narwhals
Build a library that accepts any dataframe type
Zero dependencies, negligible overhead, battle-tested in 25+ libraries

Fugue
Scale existing pandas code to Spark, Dask, or Ray, or mix SQL with Python
Keep your functions unchanged, just swap the engine

The key distinction is the type of backend each tool targets: SQL databases (Ibis), dataframe libraries (Narwhals), or distributed engines (Fugue). Once you know which backend type you need, the choice becomes straightforward.
Related Tutorials

Cloud Scaling: Coiled: Scale Python Data Pipeline to the Cloud in Minutes for deploying Dask workloads to the cloud

📚 Want to go deeper? My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Portable DataFrames in Python: When to Use Ibis, Narwhals, or Fugue Read More »

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings

Table of Contents

Introduction
Setup
Cleaner Column Operations with pd.col
Copy-on-Write Is Now the Default
A Dedicated String Dtype
Final Thoughts

Introduction
pandas 3.0 brings some of the most significant changes to the library in years. This article covers:

pd.col expressions: Cleaner column operations without lambdas
Copy-on-Write: Predictable copy behavior by default
PyArrow-backed strings: Faster operations and better type safety

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Setup
pandas 3.0 requires Python 3.11 or higher. Install it with:
pip install –upgrade pandas

To test these features before upgrading, enable them in pandas 2.3:
pd.options.future.infer_string = True
pd.options.mode.copy_on_write = True

Cleaner Column Operations with pd.col
The Traditional Approaches
If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.
Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:
import pandas as pd

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.
df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified – no way to see the original state
df_original

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(…).query(…).sort_values(…) in one expression instead of multiple separate statements.
df = pd.DataFrame({"temp_c": [0, 20, 30]})

# This doesn't work – square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')

# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df

temp_c
temp_f

1
20
68.0

2
30
86.0

Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
.query('temp_f > 50')
)
df

temp_c
temp_f

1
20
68.0

2
30
86.0

3
100
212.0

This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = lambda df: df['x'] * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
30
30
30

1
2
60
60
60

2
3
90
90
90

What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.
Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.
The pandas 3.0 Solution: pd.col
pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.
Here’s the temp_f conversion rewritten with pd.col:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.
Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = pd.col('x') * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
10
20
30

1
2
20
40
60

2
3
30
60
90

Filtering with Expressions
Traditional filtering repeats df twice:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0] # df appears twice
df

temp_c

1
0

2
15

3
25

4
30

With pd.col, you reference the column directly:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0] # cleaner
df

temp_c

1
0

2
15

3
25

4
30

Combining Multiple Columns
With lambdas, you need to repeat lambda x: x[…] for every column:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=lambda x: x["price"] * x["quantity"],
discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

With pd.col, the same logic is more readable:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=pd.col("price") * pd.col("quantity"),
discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:
# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean()) # Not supported yet

This limitation may be removed in future versions.
Copy-on-Write Is Now the Default
If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:
# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high" # SettingWithCopyWarning!

Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.
pandas 3.0 makes the answer simple: filtering with df[…] always returns a copy. Modifying df2 never affects df.
This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.
Now when you filter and modify, there’s no warning and no uncertainty:
df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})

# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high" # Modifies df2 only, not df

df2

value
status

1
15
high

2
25
high

df

value
status

0
5
low

1
15
low

2
25
low

We can see that df is unchanged and no warning was raised.
Breaking Change: Chained Assignment
One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:
# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})

df["foo"][df["bar"] > 5] = 100
df

foo
bar

0
1
4

1
2
6

2
3
8

Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.
Use .loc instead to modify the original DataFrame:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df

foo
bar

0
1
4

1
100
6

2
100
8

A Dedicated String Dtype
pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:
pd.options.future.infer_string = False # pandas 2.x behavior

text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])

print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")

text dtype: object
messy dtype: object

pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:
pd.options.future.infer_string = True # pandas 3.0 behavior

ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")

dtype: str

Performance Gains
The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:

String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead

Arrow Ecosystem Interoperability
DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:
import polars as pl

pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df) # Zero-copy – data already in Arrow format
polars_df

name

0
alice

1
bob

2
charlie

Final Thoughts
pandas 3.0 brings meaningful improvements to your daily workflow:

Write cleaner code with pd.col expressions instead of lambdas
Avoid SettingWithCopyWarning confusion with Copy-on-Write as the default
Get 5-10x faster string operations with the new PyArrow-backed str dtype
Pass DataFrames to Polars and DuckDB without data conversion

Related Resources
For more on DataFrame tools and performance optimization:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames – Compare pandas with Polars for performance-critical workflows
Scaling Pandas Workflows with PySpark’s Pandas API – Use familiar pandas syntax on distributed data
pandas vs Polars vs DuckDB: A Data Scientist’s Guide – Choose the right tool for your data analysis needs

💡 The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings Read More »

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Table of Contents

Introduction
Tool Strengths at a Glance
Setup
Syntax Comparison
Data Loading Performance
Query Optimization
GroupBy Performance
Memory Efficiency
Join Operations
Interoperability
Decision Matrix
Final Thoughts

Introduction
pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.
Each tool excels in different scenarios:

Tool
Backend
Execution Model
Best For

pandas
C/Python
Eager, single-threaded
Small datasets, prototyping, ML integration

Polars
Rust
Lazy/Eager, multi-threaded
Large-scale analytics, data pipelines

DuckDB
C++
SQL-first, multi-threaded
SQL workflows, embedded analytics, file queries

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Tool Strengths at a Glance
pandas
pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

Direct compatibility with scikit-learn, statsmodels, and visualization libraries
Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
Mature time series functionality
Familiar syntax that most data scientists already know

Polars
Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

Speeds up operations by using all available CPU cores by default
Builds a query plan first, then executes only what’s needed
Streaming mode for processing datasets larger than RAM
Expressive method chaining with a pandas-like API

DuckDB
DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

Native SQL syntax with full analytical query support
Queries CSV, Parquet, and JSON files directly without loading
Uses disk storage automatically when data exceeds available memory
Zero-configuration embedded database requiring no server setup

Setup
Install all three libraries:
pip install pandas polars duckdb

Generate sample data for benchmarking:
import pandas as pd
import numpy as np

np.random.seed(42)
n_rows = 5_000_000

data = {
"category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
"region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
"amount": np.random.rand(n_rows) * 1000,
"quantity": np.random.randint(1, 100, size=n_rows),
}

df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")

Created sales_data.csv with 5,000,000 rows

Syntax Comparison
All three tools can perform the same operations with different syntax. Here’s a side-by-side comparison of common tasks.
Filtering Rows
pandas:
Uses bracket notation with boolean conditions, which requires repeating the DataFrame name for each condition:
import pandas as pd

df_pd = pd.read_csv("sales_data.csv")
result_bracket = df_pd[(df_pd["amount"] > 500) & (df_pd["category"] == "Electronics")]
result_bracket.head()

category
region
amount
quantity

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

30
Electronics
North
766.081832
7

31
Electronics
West
772.084261
36

37
Electronics
East
527.967145
35

Alternatively, you can use the query() method, which provides cleaner SQL-like syntax:
result_query = df_pd.query("amount > 500 and category == 'Electronics'")

However, since query() is string-based, there’s no IDE autocomplete. Complex operations like string methods still require brackets:
result_str = df_pd[df_pd["category"].str.startswith("Elec")]
result_str.head()

category
region
amount
quantity

2
Electronics
North
450.941022
93

6
Electronics
West
475.843957
61

7
Electronics
West
662.803066
80

15
Electronics
North
826.004963
25

21
Electronics
South
292.399383
13

Polars:
Unlike pandas, Polars uses one syntax for all filters. The pl.col() expressions are type-safe with IDE autocomplete, and handle both simple comparisons and complex operations like string methods:
import polars as pl

df_pl = pl.read_csv("sales_data.csv")
result_pl = df_pl.filter(
(pl.col("amount") > 500) & (pl.col("category") == "Electronics")
)
result_pl.head()

category
region
amount
quantity

str
str
f64
i64

“Electronics”
“West”
662.803066
80

“Electronics”
“North”
826.004963
25

“Electronics”
“North”
766.081832
7

“Electronics”
“West”
772.084261
36

“Electronics”
“East”
527.967145
35

DuckDB:
Uses standard SQL with a WHERE clause, which is more readable by those who know SQL.
import duckdb

result_duckdb = duckdb.sql("""
SELECT * FROM 'sales_data.csv'
WHERE amount > 500 AND category = 'Electronics'
""").df()
result_duckdb.head()

category
region
amount
quantity

0
Electronics
West
662.803066
80

1
Electronics
North
826.004963
25

2
Electronics
North
766.081832
7

3
Electronics
West
772.084261
36

4
Electronics
East
527.967145
35

Selecting Columns
pandas:
Double brackets return a DataFrame with selected columns.
result_pd = df_pd[["category", "amount"]]
result_pd.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

Polars:
The select() method clearly communicates column selection intent.
result_pl = df_pl.select(["category", "amount"])
result_pl.head()

category
amount

str
f64

“Food”
516.653322

“Books”
937.337226

“Electronics”
450.941022

“Food”
674.488081

“Food”
188.847906

DuckDB:
SQL’s SELECT clause makes column selection intuitive for SQL users.
result_duckdb = duckdb.sql("""
SELECT category, amount FROM 'sales_data.csv'
""").df()
result_duckdb.head()

category
amount

0
Food
516.653322

1
Books
937.337226

2
Electronics
450.941022

3
Food
674.488081

4
Food
188.847906

GroupBy Aggregation
pandas:
Uses a dictionary to specify aggregations, but returns multi-level column headers that often require flattening before further use.
result_pd = df_pd.groupby("category").agg({
"amount": ["sum", "mean"],
"quantity": "sum"
})
result_pd.head()

amount

quantity

sum
mean
sum

Books
6.247506e+08
499.998897
62463285

Clothing
6.253924e+08
500.139837
62505224

Electronics
6.244453e+08
499.938189
62484265

Food
6.254034e+08
499.916417
62577943

Polars:
Uses explicit alias() calls for each aggregation, producing flat column names directly without post-processing.
result_pl = df_pl.group_by("category").agg([
pl.col("amount").sum().alias("amount_sum"),
pl.col("amount").mean().alias("amount_mean"),
pl.col("quantity").sum().alias("quantity_sum"),
])
result_pl.head()

category
amount_sum
amount_mean
quantity_sum

str
f64
f64
i64

“Clothing”
6.2539e8
500.139837
62505224

“Books”
6.2475e8
499.998897
62463285

“Electronics”
6.2445e8
499.938189
62484265

“Food”
6.2540e8
499.916417
62577943

DuckDB:
Standard SQL aggregation with column aliases produces clean, flat output ready for downstream use.
result_duckdb = duckdb.sql("""
SELECT
category,
SUM(amount) as amount_sum,
AVG(amount) as amount_mean,
SUM(quantity) as quantity_sum
FROM 'sales_data.csv'
GROUP BY category
""").df()
result_duckdb.head()

category
amount_sum
amount_mean
quantity_sum

0
Food
6.254034e+08
499.916417
62577943.0

1
Electronics
6.244453e+08
499.938189
62484265.0

2
Clothing
6.253924e+08
500.139837
62505224.0

3
Books
6.247506e+08
499.998897
62463285.0

Adding Columns
pandas:
The assign() method creates new columns with repeated DataFrame references like df_pd["amount"].
result_pd = df_pd.assign(
amount_with_tax=df_pd["amount"] * 1.1,
high_value=df_pd["amount"] > 500
)
result_pd.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Polars:
The with_columns() method uses composable expressions that chain naturally without repeating the DataFrame name.
result_pl = df_pl.with_columns([
(pl.col("amount") * 1.1).alias("amount_with_tax"),
(pl.col("amount") > 500).alias("high_value")
])
result_pl.head()

category
region
amount
quantity
amount_with_tax
high_value

str
str
f64
i64
f64
bool

“Food”
“South”
516.653322
40
568.318654
true

“Books”
“East”
937.337226
45
1031.070948
true

“Electronics”
“North”
450.941022
93
496.035124
false

“Food”
“East”
674.488081
46
741.936889
true

“Food”
“East”
188.847906
98
207.732697
false

DuckDB:
SQL’s SELECT clause defines new columns directly in the query, keeping transformations readable.
result_duckdb = duckdb.sql("""
SELECT *,
amount * 1.1 as amount_with_tax,
amount > 500 as high_value
FROM df_pd
""").df()
result_duckdb.head()

category
region
amount
quantity
amount_with_tax
high_value

0
Food
South
516.653322
40
568.318654
True

1
Books
East
937.337226
45
1031.070948
True

2
Electronics
North
450.941022
93
496.035124
False

3
Food
East
674.488081
46
741.936889
True

4
Food
East
188.847906
98
207.732697
False

Conditional Logic
pandas:
Each additional condition in np.where() adds another nesting level. With three tiers, the final value is buried two levels deep:
import numpy as np

# Hard to read: "low" is nested inside two np.where() calls
result_np = df_pd.assign(
value_tier=np.where(
df_pd["amount"] > 700, "high",
np.where(df_pd["amount"] > 300, "medium", "low")
)
)
result_np[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

For numeric binning, pd.cut() is cleaner:
result_pd = df_pd.assign(
value_tier=pd.cut(
df_pd["amount"],
bins=[-np.inf, 300, 700, np.inf],
labels=["low", "medium", "high"]
)
)
result_pd[["category", "amount", "value_tier"]].head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

However, pd.cut() has drawbacks:

Only works for numeric ranges
Requires thinking in boundaries ([-inf, 300, 700, inf]) instead of conditions (amount > 700)
Needs numpy for open-ended bins

For non-numeric or mixed conditions, you’re back to np.where():
# "premium" if Electronics AND amount > 500 – pd.cut() can't do this
result = df_pd.assign(
tier=np.where(
(df_pd["category"] == "Electronics") & (df_pd["amount"] > 500),
"premium", "standard"
)
)
result.head()

category
region
amount
quantity
tier

0
Food
South
516.653322
40
standard

1
Books
East
937.337226
45
standard

2
Electronics
North
450.941022
93
standard

3
Food
East
674.488081
46
standard

4
Food
East
188.847906
98
standard

Polars:
The when().then().otherwise() chain solves both pandas problems: no nesting like np.where(), and works for any condition (not just numeric ranges like pd.cut()). The same syntax handles simple binning and complex mixed conditions:
result_pl = df_pl.with_columns(
pl.when(pl.col("amount") > 700).then(pl.lit("high"))
.when(pl.col("amount") > 300).then(pl.lit("medium"))
.otherwise(pl.lit("low"))
.alias("value_tier")
)
result_pl.select(["category", "amount", "value_tier"]).head()

category
amount
value_tier

str
f64
str

“Food”
516.653322
“medium”

“Books”
937.337226
“high”

“Electronics”
450.941022
“medium”

“Food”
674.488081
“medium”

“Food”
188.847906
“low”

DuckDB:
Standard SQL CASE WHEN syntax is more readable by those who know SQL.
result_duckdb = duckdb.sql("""
SELECT category, amount,
CASE
WHEN amount > 700 THEN 'high'
WHEN amount > 300 THEN 'medium'
ELSE 'low'
END as value_tier
FROM df_pd
""").df()
result_duckdb.head()

category
amount
value_tier

0
Food
516.653322
medium

1
Books
937.337226
high

2
Electronics
450.941022
medium

3
Food
674.488081
medium

4
Food
188.847906
low

Window Functions
pandas:
Uses groupby().transform() which requires repeating the groupby clause for each calculation.
result_pd = df_pd.assign(
category_avg=df_pd.groupby("category")["amount"].transform("mean"),
category_rank=df_pd.groupby("category")["amount"].rank(ascending=False)
)
result_pd[["category", "amount", "category_avg", "category_rank"]].head()

category
amount
category_avg
category_rank

0
Food
516.653322
499.916417
604342.0

1
Books
937.337226
499.998897
78423.0

2
Electronics
450.941022
499.938189
685881.0

3
Food
674.488081
499.916417
407088.0

4
Food
188.847906
499.916417
1015211.0

Polars:
The over() expression appends the partition to any calculation, avoiding repeated group definitions.
result_pl = df_pl.with_columns([
pl.col("amount").mean().over("category").alias("category_avg"),
pl.col("amount").rank(descending=True).over("category").alias("category_rank")
])
result_pl.select(["category", "amount", "category_avg", "category_rank"]).head()

category
amount
category_avg
category_rank

str
f64
f64
f64

“Food”
516.653322
499.916417
604342.0

“Books”
937.337226
499.998897
78423.0

“Electronics”
450.941022
499.938189
685881.0

“Food”
674.488081
499.916417
407088.0

“Food”
188.847906
499.916417
1015211.0

DuckDB:
SQL window functions with OVER (PARTITION BY …) are the industry standard for this type of calculation.
result_duckdb = duckdb.sql("""
SELECT category, amount,
AVG(amount) OVER (PARTITION BY category) as category_avg,
RANK() OVER (PARTITION BY category ORDER BY amount DESC) as category_rank
FROM df_pd
""").df()
result_duckdb.head()

category
amount
category_avg
category_rank

0
Clothing
513.807166
500.139837
608257

1
Clothing
513.806596
500.139837
608258

2
Clothing
513.806515
500.139837
608259

3
Clothing
513.806063
500.139837
608260

4
Clothing
513.806056
500.139837
608261

Data Loading Performance
pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.
pandas
Single-threaded CSV parsing loads data sequentially.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → … → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

pandas_time = %timeit -o pd.read_csv("sales_data.csv")

1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars
Multi-threaded parsing distributes file reading across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

polars_time = %timeit -o pl.read_csv("sales_data.csv")

137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB
Similar to Polars, file reading is distributed across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()

762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")

Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas

While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.
Query Optimization
pandas: No Optimization
pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows → 10M rows in memory │
│ Step 2: Filter (amount > 100) → 5M rows in memory │
│ Step 3: GroupBy → New DataFrame │
│ Step 4: Mean → Final result │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high – stores all intermediates)

def pandas_query():
return (
pd.read_csv("sales_data.csv")
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_opt_time = %timeit -o pandas_query()

1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

Full CSV load: All rows are read before filtering
No predicate pushdown: Rows are filtered after loading the entire file into memory
No projection pushdown: All columns are loaded, even unused ones

You can manually add usecols to load fewer columns:
def pandas_query_optimized():
return (
pd.read_csv("sales_data.csv", usecols=["category", "amount"])
.query('amount > 100')
.groupby('category')['amount']
.mean()
)

pandas_usecols_time = %timeit -o pandas_query_optimized()

1.06 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is faster, but has two drawbacks:

Manual tracking: You must specify columns yourself; change the query, update usecols
No row filtering: All rows still load before the filter applies

Polars and DuckDB handle both automatically by analyzing your query before execution.
Polars: Lazy Evaluation
Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ scan_csv → filter → group_by → agg │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (filter during scan) │
│ • Projection pushdown (read only needed columns) │
│ • Multi-threaded execution (parallel across CPU cores) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – no intermediate DataFrames)

query_pl = (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
)

# View the optimized query plan
print(query_pl.explain())

AGGREGATE[maintain_order: false]
[col("amount").mean().alias("avg_amount")] BY [col("category")]
FROM
Csv SCAN [sales_data.csv] [id: 4687118704]
PROJECT 2/4 COLUMNS
SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

Predicate pushdown: SELECTION filters during scan, not after loading
Projection pushdown: PROJECT 2/4 COLUMNS reads only what’s needed
Operation reordering: Aggregate runs on filtered data, not the full dataset

Execute the optimized query:
def polars_query():
return (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
.collect()
)

polars_opt_time = %timeit -o polars_query()

148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer
DuckDB’s SQL optimizer applies similar optimizations automatically:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ SQL → Parser → Optimizer → Execution Plan │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (WHERE during scan) │
│ • Projection pushdown (SELECT only needed columns) │
│ • Vectorized execution (process 1024 rows per batch) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low – streaming execution)

def duckdb_query():
return duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
WHERE amount > 100
GROUP BY category
""").df()

duckdb_opt_time = %timeit -o duckdb_query()

245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:
print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")

Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas

Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.
GroupBy Performance
Computing aggregates requires scanning every row, a workload that scales linearly with CPU cores. This makes groupby operations the clearest test of parallel execution.
Let’s load the data for the groupby benchmarks:
# Load data for fair comparison
df_pd = pd.read_csv("sales_data.csv")
df_pl = pl.read_csv("sales_data.csv")

pandas: Single-Threaded
pandas processes groupby operations on a single CPU core, which becomes a bottleneck on large datasets.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Group A → Group B → Group C → Group D → … → Aggregate │ │
│ └─────────────────────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────────────────────┘

def pandas_groupby():
return df_pd.groupby("category")["amount"].mean()

pandas_groupby_time = %timeit -o pandas_groupby()

271 ms ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars: Multi-Threaded
Polars splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def polars_groupby():
return df_pl.group_by("category").agg(pl.col("amount").mean())

polars_groupby_time = %timeit -o polars_groupby()

31.1 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB splits data across cores, computes partial aggregates in parallel, then merges the results.
┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 2 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 3 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ CPU Core 4 ┌──────────────┐ │
│ │ ████████████ │ → Partial Aggregate │
│ ↓ │
│ Final Merge → Result │
└─────────────────────────────────────────────────────────────┘

def duckdb_groupby():
return duckdb.sql("""
SELECT category, AVG(amount)
FROM df_pd
GROUP BY category
""").df()

duckdb_groupby_time = %timeit -o duckdb_groupby()

29 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

print(f"Polars is {pandas_groupby_time.average / polars_groupby_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_groupby_time.average / duckdb_groupby_time.average:.1f}× faster than pandas")

Polars is 8.7× faster than pandas
DuckDB is 9.4× faster than pandas

DuckDB and Polars perform similarly (9.4× vs 8.7×), both leveraging parallel execution. DuckDB’s slight edge comes from late materialization and vector-at-a-time pipelined execution, which avoids creating intermediate results that Polars may still materialize for some operations.
Memory Efficiency
pandas: Full Memory Load
pandas loads the entire dataset into RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████│ │
│ │██████████████████ ALL 10M ROWS ████████████████████████│ │
│ │████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 707,495 KB (entire dataset in memory) │
└─────────────────────────────────────────────────────────────┘

df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")

pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.
Polars: Streaming Mode
Polars can process data in streaming mode, handling chunks without loading everything:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │█ │ │
│ │ (result only) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 0.06 KB (streams chunks, keeps only result) │
└─────────────────────────────────────────────────────────────┘

result_pl_stream = (
pl.scan_csv("sales_data.csv")
.group_by("category")
.agg(pl.col("amount").mean())
.collect(streaming=True)
)

polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")

Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:
(
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 500)
.sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk
DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM Disk (if needed) │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │█ │ │░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ (up to 500MB) │ → │ (overflow here) │ │
│ │ │ │ │ │
│ └──────────────────────────┘ └──────────────────────┘ │
│ Usage: 0.42 KB (spills to disk when RAM full) │
└─────────────────────────────────────────────────────────────┘

# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")

# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
GROUP BY category
""").df()

duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")

DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.
print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")

pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)

Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.
Join Operations
Joining tables is one of the most common operations in data analysis. Let’s compare how each tool handles a left join between 1 million orders and 100K customers.
Let’s create two tables for join benchmarking:
# Create orders table (1M rows)
orders_pd = pd.DataFrame({
"order_id": range(1_000_000),
"customer_id": np.random.randint(1, 100_000, size=1_000_000),
"amount": np.random.rand(1_000_000) * 500
})

# Create customers table (100K rows)
customers_pd = pd.DataFrame({
"customer_id": range(100_000),
"region": np.random.choice(["North", "South", "East", "West"], size=100_000)
})

# Convert to Polars
orders_pl = pl.from_pandas(orders_pd)
customers_pl = pl.from_pandas(customers_pd)

pandas: Single-Threaded
pandas processes the join on a single CPU core.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Row 1 → Row 2 → Row 3 → … → Row 1M │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘

def pandas_join():
return orders_pd.merge(customers_pd, on="customer_id", how="left")

pandas_join_time = %timeit -o pandas_join()

60.4 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Polars: Multi-Threaded
Polars distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def polars_join():
return orders_pl.join(customers_pl, on="customer_id", how="left")

polars_join_time = %timeit -o polars_join()

11.8 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded
Similar to Polars, DuckDB distributes the join across all available CPU cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘

def duckdb_join():
return duckdb.sql("""
SELECT o.*, c.region
FROM orders_pd o
LEFT JOIN customers_pd c ON o.customer_id = c.customer_id
""").df()

duckdb_join_time = %timeit -o duckdb_join()

55.7 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s compare the performance of the joins:
print(f"Polars is {pandas_join_time.average / polars_join_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_join_time.average / duckdb_join_time.average:.1f}× faster than pandas")

Polars is 5.1× faster than pandas
DuckDB is 1.1× faster than pandas

Polars delivers a 5.1× speedup while DuckDB shows only 1.1× improvement. Both tools use multi-threading, but Polars’ join algorithm and native DataFrame output avoid the conversion overhead that DuckDB incurs when returning results via .df().
Interoperability
All three tools work together seamlessly. Use each tool for what it does best in a single pipeline.
pandas DataFrame to DuckDB
Query pandas DataFrames directly with SQL:
df = pd.DataFrame({
"product": ["A", "B", "C"],
"sales": [100, 200, 150]
})

# DuckDB queries pandas DataFrames by variable name
result = duckdb.sql("SELECT * FROM df WHERE sales > 120").df()
print(result)

product sales
0 B 200
1 C 150

Polars to pandas
Convert Polars DataFrames when ML libraries require pandas:
df_polars = pl.DataFrame({
"feature1": [1, 2, 3],
"feature2": [4, 5, 6],
"target": [0, 1, 0]
})

# Convert to pandas for scikit-learn
df_pandas = df_polars.to_pandas()
print(type(df_pandas))

<class 'pandas.core.frame.DataFrame'>

DuckDB to Polars
Get query results as Polars DataFrames:
result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales_data.csv'
GROUP BY category
""").pl()

print(type(result))
print(result)

<class 'polars.dataframe.frame.DataFrame'>
shape: (4, 2)
┌─────────────┬──────────┐
│ category ┆ total │
│ — ┆ — │
│ str ┆ f64 │
╞═════════════╪══════════╡
│ Electronics ┆ 6.2445e8 │
│ Food ┆ 6.2540e8 │
│ Clothing ┆ 6.2539e8 │
│ Books ┆ 6.2475e8 │
└─────────────┴──────────┘

Combined Pipeline Example
Each tool has a distinct strength: DuckDB optimizes SQL queries, Polars parallelizes transformations, and pandas integrates with ML libraries. Combine them in a single pipeline to leverage all three:
# Step 1: DuckDB for initial SQL query
aggregated = duckdb.sql("""
SELECT category, region,
SUM(amount) as total_amount,
COUNT(*) as order_count
FROM 'sales_data.csv'
GROUP BY category, region
""").pl()

# Step 2: Polars for additional transformations
enriched = (
aggregated
.with_columns([
(pl.col("total_amount") / pl.col("order_count")).alias("avg_order_value"),
pl.col("category").str.to_uppercase().alias("category_upper")
])
.filter(pl.col("order_count") > 100000)
)

# Step 3: Convert to pandas for visualization or ML
final_df = enriched.to_pandas()
print(final_df.head())

category region total_amount order_count avg_order_value category_upper
0 Food East 1.563586e+08 312918 499.679004 FOOD
1 Food North 1.563859e+08 312637 500.215456 FOOD
2 Clothing North 1.560532e+08 311891 500.345286 CLOTHING
3 Clothing East 1.565054e+08 312832 500.285907 CLOTHING
4 Food West 1.560994e+08 312662 499.259318 FOOD

📖 Related: For writing functions that work across pandas, Polars, and PySpark without conversion, see Unified DataFrame Functions.

Decision Matrix
No single tool wins in every scenario. Use these tables to choose the right tool for your workflow.
Performance Summary
Benchmark results from 10 million rows on a single machine:

Operation
pandas
Polars
DuckDB

CSV Read (10M rows)
1.05s
137ms
762ms

GroupBy
271ms
31ms
29ms

Join (1M rows)
60ms
12ms
56ms

Memory Usage
707 MB
0.06 KB (streaming)
0.42 KB (spill-to-disk)

Polars leads in CSV reading (7.7× faster than pandas) and joins (5× faster). DuckDB matches Polars in groupby performance and uses the least memory with automatic spill-to-disk.
Feature Comparison
Each tool makes different trade-offs between speed, memory, and ecosystem integration:

Feature
pandas
Polars
DuckDB

Multi-threading
No
Yes
Yes

Lazy evaluation
No
Yes
N/A (SQL)

Query optimization
No
Yes
Yes

Larger-than-RAM
No
Streaming
Spill-to-disk

SQL interface
No
Limited
Native

ML integration
Excellent
Good
Limited

pandas lacks the performance features that make Polars and DuckDB fast, but remains essential for ML workflows. Choose between Polars and DuckDB based on whether you prefer DataFrame chaining or SQL syntax.
Recommendations
The best tool depends on your data size, workflow preferences, and constraints:

Small data (<1M rows): Use pandas for simplicity
Large data (1M-100M rows): Use Polars or DuckDB for 5-10× speedup
SQL-preferred workflow: Use DuckDB
DataFrame-preferred workflow: Use Polars
Memory-constrained: Use Polars (streaming) or DuckDB (spill-to-disk)
ML pipeline integration: Use pandas (convert from Polars/DuckDB as needed)
Production data pipelines: Use Polars (DataFrame) or DuckDB (SQL) based on team preference

Final Thoughts
If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:

Profile first: Find which pandas operations are slow
Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() when needed.
Related Tutorials

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames: Deep comparison of Polars and pandas performance and API differences
A Deep Dive into DuckDB for Data Scientists: Explore DuckDB’s SQL-native analytics with Python integration
Scaling Pandas Workflows with PySpark’s Pandas API: Scale existing pandas code to distributed processing with PySpark
Delta Lake: Transform pandas Prototypes into Production: Add ACID transactions and versioning to your pandas data pipelines

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool Read More »

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Table of Contents

Introduction
Getting Started
Creating DataFrames
Understanding Lazy Evaluation
Data Exploration
Selection & Filtering
Column Operations
Aggregation Functions
String Functions
Date/Time Functions
Working with Time Series
Window Analytics
Join Operations
SQL Integration
Custom Functions
SQL Expressions
Conclusion

Introduction
pandas works great for prototyping but fails when datasets grow beyond memory capacity. While PySpark offers distributed computing to handle massive datasets, mastering its new syntax and rewriting existing code creates a steep barrier to adoption.
PySpark SQL bridges this gap by offering SQL-style DataFrame operations and query strings, eliminating the need to learn PySpark’s lower-level RDD programming model and functional transformations.
In this comprehensive guide, you’ll learn PySpark SQL from the ground up:

Load, explore, and manipulate DataFrames with selection and filtering operations
Aggregate data and work with strings, dates, and time series
Use window functions for rankings, running totals, and moving averages
Execute SQL queries alongside DataFrame operations
Create custom functions with pandas UDFs for vectorized performance

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Getting Started
First, install PySpark:
pip install pyspark

Create a SparkSession to start working with PySpark:
from pyspark.sql import SparkSession

spark = (
SparkSession.builder
.appName("PySpark SQL Guide")
.getOrCreate()
)

The SparkSession is your entry point to all PySpark functionality.
Creating DataFrames
PySpark supports creating DataFrames from multiple sources including Python objects, pandas DataFrames, files, and databases.
Create from Python dictionaries:
data = {
"customer_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"region": ["North", "South", "North", "East", "West"],
"amount": [100, 150, 200, 120, 180]
}

# Create DataFrame from Python dictionary using zip and tuples
df = spark.createDataFrame(
[(k, v1, v2, v3) for k, v1, v2, v3 in zip(
data["customer_id"],
data["name"],
data["region"],
data["amount"]
)],
["customer_id", "name", "region", "amount"]
)

Convert from pandas:
import pandas as pd

pandas_df = pd.DataFrame(data)
# Convert pandas DataFrame to PySpark DataFrame
df = spark.createDataFrame(pandas_df)
df.show()

Load from CSV files:
# Load CSV file with automatic schema inference
df = spark.read.csv("data.csv", header=True, inferSchema=True)

Read with schema specification:
from pyspark.sql.types import StructType, StructField, StringType, LongType

schema = StructType([
StructField("customer_id", LongType(), True),
StructField("name", StringType(), True),
StructField("region", StringType(), True),
StructField("amount", LongType(), True)
])

# Load CSV file with explicit schema definition
df = spark.read.csv("data.csv", header=True, schema=schema)

Understanding Lazy Evaluation
PySpark’s execution model differs fundamentally from pandas. Operations are divided into two types.
Transformations are lazy operations that build execution plans without running:
# Transformations return immediately
filtered = df.filter(df.amount > 100)
print(f"Filtered: {filtered}")

selected = filtered.select("name", "amount")
print(f"Selected: {selected}")

Output:
Filtered: DataFrame[customer_id: bigint, name: string, region: string, amount: bigint]
Selected: DataFrame[name: string, amount: bigint]

Common transformations include select(), filter(), withColumn(), and groupBy(). They return instantly because they only build an execution plan and can be chained without performance cost.
Actions trigger execution and return actual results:
selected.show()

Output:
+——-+——+
| name|amount|
+——-+——+
| Bob| 150|
|Charlie| 200|
| Diana| 120|
| Eve| 180|
+——-+——+

Common actions include show(), collect(), count(), and describe(). They execute the entire chain of transformations and return actual results.
This lazy evaluation enables Spark’s Catalyst optimizer to analyze your complete workflow and apply optimizations like predicate pushdown and column pruning before execution.
Data Exploration
Data exploration in PySpark works similarly to pandas, but with methods designed for distributed computing. Instead of pandas’ df.info() and df.head(), PySpark uses printSchema() and show() to inspect schemas and preview records across the cluster.
View the schema:
df.printSchema()

Output:
root
|– customer_id: long (nullable = true)
|– name: string (nullable = true)
|– region: string (nullable = true)
|– amount: long (nullable = true)

Preview the first few rows:
# Display the first 5 rows of the DataFrame
df.show(5)

Output:
+———–+——-+——+——+
|customer_id| name|region|amount|
+———–+——-+——+——+
| 1| Alice| North| 100|
| 2| Bob| South| 150|
| 3|Charlie| North| 200|
| 4| Diana| East| 120|
| 5| Eve| West| 180|
+———–+——-+——+——+

Get summary statistics:
# Generate summary statistics for numeric columns
df.describe().show()

Output:
+——-+———–+——-+——+——————+
|summary|customer_id| name|region| amount|
+——-+———–+——-+——+——————+
| count| 5| 5| 5| 5|
| mean| 3.0| null| null| 150.0|
| stddev| 1.58| null| null|40.311288741492746|
| min| 1| Alice| East| 100|
| max| 5| Eve| West| 200|
+——-+———–+——-+——+——————+

Count total rows:
df.count()

Output:
5

List column names:
df.columns

Output:
['customer_id', 'name', 'region', 'amount']

Get distinct values in a column:
# Get all unique values in the region column
df.select("region").distinct().show()

Output:
+——+
|region|
+——+
| South|
| East|
| West|
| North|
+——+

Sample random rows:
# Randomly sample 60% of the rows
df.sample(fraction=0.6, seed=42).show()

Output:
+———–+—–+——+——+
|customer_id| name|region|amount|
+———–+—–+——+——+
| 2| Bob| South| 150|
| 4|Diana| East| 120|
+———–+—–+——+——+

Selection & Filtering
When selecting and filtering data, PySpark uses explicit methods like select() and filter() that build distributed execution plans.
Select specific columns:
# Select columns name and amount
df.select("name", "amount").show()

Output:
+——-+——+
| name|amount|
+——-+——+
| Alice| 100|
| Bob| 150|
|Charlie| 200|
| Diana| 120|
| Eve| 180|
+——-+——+

Filter rows with conditions:
# Filter rows where amount is greater than 150
df.filter(df.amount > 150).show()

Output:
+———–+——-+——+——+
|customer_id| name|region|amount|
+———–+——-+——+——+
| 3|Charlie| North| 200|
| 5| Eve| West| 180|
+———–+——-+——+——+

Chain multiple filters:
# Get rows where amount is greater than 100 and region is North
(
df.filter(df.amount > 100)
.filter(df.region == "North")
.show()
)

Output:
+———–+——-+——+——+
|customer_id| name|region|amount|
+———–+——-+——+——+
| 3|Charlie| North| 200|
+———–+——-+——+——+

Drop columns:
# Drop the customer_id column
df.drop("customer_id").show()

Output:
+——-+——+——+
| name|region|amount|
+——-+——+——+
| Alice| North| 100|
| Bob| South| 150|
|Charlie| North| 200|
| Diana| East| 120|
| Eve| West| 180|
+——-+——+——+

Column Operations
Unlike pandas’ mutable operations where df['new_col'] modifies the DataFrame in place, PySpark’s withColumn() and withColumnRenamed() return new DataFrames, maintaining the distributed computing model.
The withColumn() method takes two arguments: the new column name and an expression defining its values:
from pyspark.sql.functions import col

# Add a new column with the amount with tax
df.withColumn("amount_with_tax", col("amount") * 1.1).show()

Output:
+———–+——-+——+——+——————+
|customer_id| name|region|amount| amount_with_tax|
+———–+——-+——+——+——————+
| 1| Alice| North| 100|110.00000000000001|
| 2| Bob| South| 150| 165.0|
| 3|Charlie| North| 200|220.00000000000003|
| 4| Diana| East| 120| 132.0|
| 5| Eve| West| 180|198.00000000000003|
+———–+——-+——+——+——————+

Add constant value columns with lit():
from pyspark.sql.functions import lit

# Add a column with the same value for all rows
df.withColumn("country", lit("USA")).select("name", "amount", "country").show()

Output:
+——-+——+——-+
| name|amount|country|
+——-+——+——-+
| Alice| 100| USA|
| Bob| 150| USA|
|Charlie| 200| USA|
| Diana| 120| USA|
| Eve| 180| USA|
+——-+——+——-+

Use withColumnRenamed() to rename a column by specifying the old name and new name:
# Rename the amount column to revenue
df.withColumnRenamed("amount", "revenue").show()

Output:
+———–+——-+——+——-+
|customer_id| name|region|revenue|
+———–+——-+——+——-+
| 1| Alice| North| 100|
| 2| Bob| South| 150|
| 3|Charlie| North| 200|
| 4| Diana| East| 120|
| 5| Eve| West| 180|
+———–+——-+——+——-+

Use the cast() method to convert a column to a different data type:
# Cast the amount column to a string
df.withColumn("amount_str", col("amount").cast("string")).printSchema()

Output:
root
|– customer_id: long (nullable = true)
|– name: string (nullable = true)
|– region: string (nullable = true)
|– amount: long (nullable = true)
|– amount_str: string (nullable = true)

Aggregation Functions
Unlike pandas’ in-memory aggregations, PySpark’s groupBy() and aggregation functions distribute calculations across cluster nodes, using the same conceptual model as pandas but with lazy evaluation.
Apply aggregation functions directly in select() to compute values across all rows without grouping:
from pyspark.sql.functions import sum, avg, count, max, min

# Calculate total revenue, average revenue, and count across all rows
df.select(
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_revenue"),
count("*").alias("order_count")
).show()

Output:
+————-+———–+———–+
|total_revenue|avg_revenue|order_count|
+————-+———–+———–+
| 750| 150.0| 5|
+————-+———–+———–+

Combine groupBy() to create groups and agg() to compute multiple aggregations per group:
# Calculate total revenue and customer count by region
(
df.groupBy("region")
.agg(
sum("amount").alias("total_revenue"),
count("*").alias("customer_count")
)
.show()
)

Output:
+——+————-+————–+
|region|total_revenue|customer_count|
+——+————-+————–+
| North| 300| 2|
| East| 120| 1|
| South| 150| 1|
| West| 180| 1|
+——+————-+————–+

Combine groupBy() with collect_list() to create arrays of values for each group:
from pyspark.sql.functions import collect_list

# Collect customer names into an array for each region
(
df.groupBy("region")
.agg(collect_list("name").alias("customers"))
.show(truncate=False)
)

Output:
+——+—————-+
|region|customers |
+——+—————-+
|South |[Bob] |
|East |[Diana] |
|West |[Eve] |
|North |[Alice, Charlie]|
+——+—————-+

String Functions
Unlike pandas’ vectorized string methods accessed via .str, PySpark provides importable functions like concat(), split(), and regexp_replace() that transform entire columns across distributed partitions.
Use concat() to combine multiple columns and literal strings, wrapping constant values with lit():
from pyspark.sql.functions import concat, lit

# Concatenate customer name and region with a separator
df.withColumn("full_info", concat(col("name"), lit(" – "), col("region"))).show()

Output:
+———–+——-+——+——+—————-+
|customer_id| name|region|amount| full_info|
+———–+——-+——+——+—————-+
| 1| Alice| North| 100| Alice – North|
| 2| Bob| South| 150| Bob – South|
| 3|Charlie| North| 200|Charlie – North|
| 4| Diana| East| 120| Diana – East|
| 5| Eve| West| 180| Eve – West|
+———–+——-+——+——+—————-+

Use split() to divide a string column into an array based on a delimiter pattern:
from pyspark.sql.functions import split

# Create sample data with email addresses
email_data = spark.createDataFrame(
[("alice@company.com",), ("bob@startup.io",), ("charlie@corp.net",)],
["email"]
)

# Split email into username and domain
(
email_data.withColumn("email_parts", split(col("email"), "@"))
.select("email", "email_parts")
.show(truncate=False)
)

Output:
+—————–+———————-+
|email |email_parts |
+—————–+———————-+
|alice@company.com|[alice, company.com] |
|bob@startup.io |[bob, startup.io] |
|charlie@corp.net |[charlie, corp.net] |
+—————–+———————-+

Use regexp_replace() to find and replace text patterns using regular expressions:
from pyspark.sql.functions import regexp_replace

# Create sample data with phone numbers
phone_data = spark.createDataFrame(
[("Alice", "123-456-7890"), ("Bob", "987-654-3210"), ("Charlie", "555-123-4567")],
["name", "phone"]
)

# Mask phone numbers, keeping only last 4 digits
(
phone_data.withColumn("masked_phone", regexp_replace(col("phone"), r"\d{3}-\d{3}-(\d{4})", "XXX-XXX-$1"))
.select("name", "phone", "masked_phone")
.show()
)

Output:
+——-+————–+————-+
| name| phone| masked_phone|
+——-+————–+————-+
| Alice|123-456-7890|XXX-XXX-7890|
| Bob|987-654-3210|XXX-XXX-3210|
|Charlie|555-123-4567|XXX-XXX-4567|
+——-+————–+————-+

Date/Time Functions
Working with dates and timestamps is essential for time-based analysis. PySpark offers comprehensive functions to extract date components, format timestamps, and perform temporal operations.
Create sample data with dates:
from datetime import datetime, timedelta

date_data = [
(1, datetime(2024, 1, 15), 100),
(2, datetime(2024, 2, 20), 150),
(3, datetime(2024, 3, 10), 200),
(4, datetime(2024, 4, 5), 120),
(5, datetime(2024, 5, 25), 180)
]

# Create sample DataFrame with datetime values
df_dates = spark.createDataFrame(date_data, ["id", "order_date", "amount"])

Use functions like year(), month(), and dayofmonth() to extract individual date components from timestamp columns:
from pyspark.sql.functions import year, month, dayofmonth

# Extract year, month, and day components from order_date
(
df_dates.withColumn("year", year("order_date"))
.withColumn("month", month("order_date"))
.withColumn("day", dayofmonth("order_date"))
.show()
)

Output:
+—+——————-+——+—-+—–+—+
| id| order_date|amount|year|month|day|
+—+——————-+——+—-+—–+—+
| 1|2024-01-15 00:00:00| 100|2024| 1| 15|
| 2|2024-02-20 00:00:00| 150|2024| 2| 20|
| 3|2024-03-10 00:00:00| 200|2024| 3| 10|
| 4|2024-04-05 00:00:00| 120|2024| 4| 5|
| 5|2024-05-25 00:00:00| 180|2024| 5| 25|
+—+——————-+——+—-+—–+—+

Use date_format() to convert dates to custom string formats:
from pyspark.sql.functions import date_format

# Format timestamps as YYYY-MM-DD strings
(
df_dates.withColumn("formatted_date", date_format("order_date", "yyyy-MM-dd"))
.select("order_date", "formatted_date")
.show()
)

Output:
+——————-+————–+
| order_date|formatted_date|
+——————-+————–+
|2024-01-15 00:00:00| 2024-01-15|
|2024-02-20 00:00:00| 2024-02-20|
|2024-03-10 00:00:00| 2024-03-10|
|2024-04-05 00:00:00| 2024-04-05|
|2024-05-25 00:00:00| 2024-05-25|
+——————-+————–+

Use to_timestamp() to convert string columns to timestamp objects by specifying the date format pattern:
from pyspark.sql.functions import to_timestamp

string_dates = spark.createDataFrame(
[("2024-01-15",), ("2024-02-20",)],
["date_string"]
)

# Convert date strings to timestamp objects
string_dates.withColumn(
"timestamp",
to_timestamp("date_string", "yyyy-MM-dd")
).show()

Output:
+———–+——————-+
|date_string| timestamp|
+———–+——————-+
| 2024-01-15|2024-01-15 00:00:00|
| 2024-02-20|2024-02-20 00:00:00|
+———–+——————-+

Working with Time Series
Time series analysis often requires comparing values across different time periods. PySpark’s window functions with lag and lead operations enable calculations of changes and trends over time.
Create sample time series data:
ts_data = [
(1, datetime(2024, 1, 1), 100),
(1, datetime(2024, 1, 2), 120),
(1, datetime(2024, 1, 3), 110),
(2, datetime(2024, 1, 1), 200),
(2, datetime(2024, 1, 2), 220),
(2, datetime(2024, 1, 3), 210)
]

# Create time series data with multiple dates per customer
df_ts = spark.createDataFrame(ts_data, ["customer_id", "date", "amount"])

Calculate the previous row’s value within each customer group using lag():
from pyspark.sql.window import Window
from pyspark.sql.functions import lag

# Create a window: group by customer_id, order by date
window_spec = Window.partitionBy("customer_id").orderBy("date")

# Get the previous amount for each customer
df_ts.withColumn("prev_amount", lag("amount").over(window_spec)).show()

Output:
+———–+——————-+——+———–+
|customer_id| date|amount|prev_amount|
+———–+——————-+——+———–+
| 1|2024-01-01 00:00:00| 100| null|
| 1|2024-01-02 00:00:00| 120| 100|
| 1|2024-01-03 00:00:00| 110| 120|
| 2|2024-01-01 00:00:00| 200| null|
| 2|2024-01-02 00:00:00| 220| 200|
| 2|2024-01-03 00:00:00| 210| 220|
+———–+——————-+——+———–+

The first row in each customer group has null for prev_amount because there’s no previous value.
Calculate day-over-day change by combining lag() to get the previous value and subtracting it from the current value:
# Calculate the day-over-day change in amount
(
df_ts.withColumn("prev_amount", lag("amount", 1).over(window_spec))
.withColumn("daily_change", col("amount") – col("prev_amount"))
.show()
)

Output:
+———–+——————-+——+———–+————+
|customer_id| date|amount|prev_amount|daily_change|
+———–+——————-+——+———–+————+
| 1|2024-01-01 00:00:00| 100| null| null|
| 1|2024-01-02 00:00:00| 120| 100| 20|
| 1|2024-01-03 00:00:00| 110| 120| -10|
| 2|2024-01-01 00:00:00| 200| null| null|
| 2|2024-01-02 00:00:00| 220| 200| 20|
| 2|2024-01-03 00:00:00| 210| 220| -10|
+———–+——————-+——+———–+————+

Window Analytics
Complex analytics operations like rankings, running totals, and moving averages require window functions that operate within data partitions. These functions enable sophisticated analytical queries without self-joins.
Apply ranking functions within partitioned groups:

Combine Window.partitionBy() and Window.orderBy() to rank within groups
rank() handles ties by giving them the same rank with gaps (e.g., 1, 2, 2, 4)
row_number() always assigns unique sequential numbers (e.g., 1, 2, 3, 4)
dense_rank() gives ties the same rank without gaps (e.g., 1, 2, 2, 3)

from pyspark.sql.functions import rank, row_number, dense_rank

# Create sample data with categories to show ranking within groups
ranking_data = spark.createDataFrame(
[("Math", "Alice", 100), ("Math", "Bob", 150), ("Math", "Charlie", 150),
("Science", "Diana", 200), ("Science", "Eve", 100)],
["subject", "name", "score"]
)

# Define window partitioned by subject, ordered by score descending
window_spec = Window.partitionBy("subject").orderBy(col("score").desc())

# Calculate different ranking methods within each subject
(
ranking_data.withColumn("rank", rank().over(window_spec))
.withColumn("row_number", row_number().over(window_spec))
.withColumn("dense_rank", dense_rank().over(window_spec))
.show()
)

Output:
+—+——-+—–+—-+———-+———-+
| id| name|score|rank|row_number|dense_rank|
+—+——-+—–+—-+———-+———-+
| 4| Diana| 200| 1| 1| 1|
| 2| Bob| 150| 2| 2| 2|
| 3|Charlie| 150| 2| 3| 2|
| 1| Alice| 100| 4| 4| 3|
| 5| Eve| 100| 4| 5| 3|
+—+——-+—–+—-+———-+———-+

Calculate running totals using rowsBetween() to define a window range:

Window.unboundedPreceding starts the window at the first row of the partition
Window.currentRow ends the window at the current row being processed
This creates an expanding window that includes all rows from the start up to the current position

from pyspark.sql.functions import sum as _sum

# Create daily sales data with store identifier
daily_sales = spark.createDataFrame(
[("A", 1, 50), ("A", 2, 75), ("A", 3, 100),
("B", 1, 25), ("B", 2, 150), ("B", 3, 80)],
["store", "day", "sales"]
)

# Define window partitioned by store, from beginning to current row
window_spec = (
Window.partitionBy("store")
.orderBy("day")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

# Calculate running total of sales per store
daily_sales.withColumn("running_total", _sum("sales").over(window_spec)).show()

Output:
+—+—–+————-+
|day|sales|running_total|
+—+—–+————-+
| 1| 50| 50|
| 2| 75| 125|
| 3| 100| 225|
| 4| 25| 250|
| 5| 150| 400|
+—+—–+————-+

Use rowsBetween(-1, 1) to create a 3-row sliding window that includes the previous row, current row, and next row:
# Define window for 3-row moving average (previous, current, next)
window_spec = (
Window.partitionBy("customer_id")
.orderBy("date")
.rowsBetween(-1, 1)
)

# Calculate moving average of amount over the window
df_ts.withColumn("moving_avg", avg("amount").over(window_spec)).show()

Output:
+———–+——————-+——+———-+
|customer_id| date|amount|moving_avg|
+———–+——————-+——+———-+
| 1|2024-01-01 00:00:00| 100| 110.0|
| 1|2024-01-02 00:00:00| 120| 110.0|
| 1|2024-01-03 00:00:00| 110| 115.0|
| 2|2024-01-01 00:00:00| 200| 210.0|
| 2|2024-01-02 00:00:00| 220| 210.0|
| 2|2024-01-03 00:00:00| 210| 215.0|
+———–+——————-+——+———-+

Join Operations
Combining data from multiple tables is a core operation in data analysis. PySpark supports various join types including inner, left, and broadcast joins, with automatic optimization for performance.
Create sample tables for joining:
# Create sample customers and orders tables for joining
customers = spark.createDataFrame(
[(1, "Alice", "US"), (2, "Bob", "UK"), (3, "Charlie", "US")],
["customer_id", "name", "country"]
)

orders = spark.createDataFrame(
[(101, 1, 100), (102, 2, 150), (103, 1, 200), (104, 3, 120)],
["order_id", "customer_id", "amount"]
)

Use join() to perform an inner join, which returns only rows with matching keys in both DataFrames:
# Join customers and orders on customer_id
customers.join(orders, "customer_id").show()

Output:
+———–+——-+——-+——–+——+
|customer_id| name|country|order_id|amount|
+———–+——-+——-+——–+——+
| 1| Alice| US| 101| 100|
| 1| Alice| US| 103| 200|
| 2| Bob| UK| 102| 150|
| 3|Charlie| US| 104| 120|
+———–+——-+——-+——–+——+

Perform a left join by specifying "left" as the third argument, which retains all left table rows regardless of matches:
# Create extended customers table including Diana
customers_extended = spark.createDataFrame(
[(1, "Alice", "US"), (2, "Bob", "UK"), (3, "Charlie", "US"), (4, "Diana", "CA")],
["customer_id", "name", "country"]
)

# Left join to keep all customers even without orders
customers_extended.join(orders, "customer_id", "left").show()

Output:
+———–+——-+——-+——–+——+
|customer_id| name|country|order_id|amount|
+———–+——-+——-+——–+——+
| 1| Alice| US| 103| 200|
| 1| Alice| US| 101| 100|
| 2| Bob| UK| 102| 150|
| 3|Charlie| US| 104| 120|
| 4| Diana| CA| NULL| NULL|
+———–+——-+——-+——–+——+

Chain multiple join() calls together to combine three or more DataFrames in sequence:
# Create products and order_items tables
products = spark.createDataFrame(
[(1, "Widget"), (2, "Gadget")],
["product_id", "product_name"]
)

order_items = spark.createDataFrame(
[(101, 1), (102, 2), (103, 1), (104, 2)],
["order_id", "product_id"]
)

# Chain multiple joins to combine orders, items, and products
(
orders.join(order_items, "order_id")
.join(products, "product_id")
.select("order_id", "customer_id", "product_name", "amount")
.show()
)

Output:
+——–+———–+————+——+
|order_id|customer_id|product_name|amount|
+——–+———–+————+——+
| 103| 1| Widget| 200|
| 101| 1| Widget| 100|
| 104| 3| Gadget| 120|
| 102| 2| Gadget| 150|
+——–+———–+————+——+

SQL Integration
PySpark supports standard SQL syntax for querying data. You can write SQL queries using familiar SELECT, JOIN, and WHERE clauses alongside PySpark operations.
Use createOrReplaceTempView() to register a DataFrame as a temporary SQL table, allowing it to be queried multiple times with SQL syntax:
# Register DataFrame as a temporary SQL view named customers
df.createOrReplaceTempView("customers")

Execute SQL queries on DataFrames registered with createOrReplaceTempView() using spark.sql():
# Execute SQL query to aggregate revenue by region
spark.sql("""
SELECT region, SUM(amount) as total_revenue
FROM customers
GROUP BY region
ORDER BY total_revenue DESC
""").show()

Output:
+——+————-+
|region|total_revenue|
+——+————-+
| North| 300|
| West| 180|
| South| 150|
| East| 120|
+——+————-+

Chain SQL queries with DataFrame operations by storing spark.sql() results and applying methods like filter() and orderBy():
# Query customers with amount greater than 100
result = spark.sql("""
SELECT customer_id, name, amount
FROM customers
WHERE amount > 100
""")

# Filter SQL results using DataFrame API and sort by amount
(
result.filter(col("region") != "South")
.orderBy("amount", ascending=False)
.show()
)

Output:
+———–+——-+——+
|customer_id| name|amount|
+———–+——-+——+
| 3|Charlie| 200|
| 5| Eve| 180|
| 4| Diana| 120|
+———–+——-+——+

Use SQL syntax within spark.sql() for complex joins and aggregations when SQL is more readable than DataFrame API:
# Register orders and customers as temporary SQL views
orders.createOrReplaceTempView("orders")
customers.createOrReplaceTempView("customers_table")

# Join and aggregate using SQL syntax
spark.sql("""
SELECT
c.name,
COUNT(o.order_id) as order_count,
SUM(o.amount) as total_spent
FROM customers_table c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY total_spent DESC
""").show()

Output:
+——-+———–+———–+
| name|order_count|total_spent|
+——-+———–+———–+
| Alice| 2| 300|
| Bob| 1| 150|
|Charlie| 1| 120|
+——-+———–+———–+

Custom Functions
When built-in functions aren’t sufficient, custom logic can be implemented using pandas UDFs. These user-defined functions provide vectorized performance through Apache Arrow and support both scalar operations and grouped transformations.

📚 For taking your data science projects from prototype to production, check out Production-Ready Data Science.

Create a scalar pandas UDF with the @pandas_udf decorator to apply custom Python functions to columns with vectorized performance:
from pyspark.sql.functions import pandas_udf
import pandas as pd

# Define pandas UDF to calculate discount based on price and quantity
@pandas_udf("double")
def calculate_discount(amount: pd.Series, quantity: pd.Series) -> pd.Series:
return amount * quantity * 0.1

# Create sample order data with price and quantity
order_data = spark.createDataFrame(
[(1, 100.0, 2), (2, 150.0, 3), (3, 200.0, 1)],
["order_id", "price", "quantity"]
)

# Apply discount calculation UDF to create discount column
order_data.withColumn(
"discount",
calculate_discount(col("price"), col("quantity"))
).show()

Output:
+——–+—–+——–+——–+
|order_id|price|quantity|discount|
+——–+—–+——–+——–+
| 1|100.0| 2| 20.0|
| 2|150.0| 3| 45.0|
| 3|200.0| 1| 20.0|
+——–+—–+——–+——–+

Apply custom pandas functions to grouped data with groupBy().applyInPandas():

Define a function that transforms each group as a pandas DataFrame
Specify output schema to tell PySpark the resulting column names and types

# Define function to normalize amounts within each group
def normalize_by_group(pdf: pd.DataFrame) -> pd.DataFrame:
pdf["normalized"] = (pdf["amount"] – pdf["amount"].mean()) / pdf["amount"].std()
return pdf

schema = "customer_id long, date timestamp, amount long, normalized double"

# Apply normalization function to each customer_id group
df_ts.groupBy("customer_id").applyInPandas(normalize_by_group, schema).show()

Output:
+———–+——————-+——+———-+
|customer_id| date|amount|normalized|
+———–+——————-+——+———-+
| 1|2024-01-01 00:00:00| 100| -1.0|
| 1|2024-01-02 00:00:00| 120| 1.0|
| 1|2024-01-03 00:00:00| 110| 0.0|
| 2|2024-01-01 00:00:00| 200| -1.0|
| 2|2024-01-02 00:00:00| 220| 1.0|
| 2|2024-01-03 00:00:00| 210| 0.0|
+———–+——————-+——+———-+

SQL Expressions
SQL expressions can be embedded directly within DataFrame operations for complex transformations. The expr() and selectExpr() functions allow SQL syntax to be used alongside DataFrame methods, providing flexibility in query construction.
Use the expr() function to embed SQL syntax within DataFrame operations, allowing SQL-style calculations in withColumn():
from pyspark.sql.functions import expr

# Add tax and total columns using SQL expressions
(
df.withColumn("tax", expr("amount * 0.1"))
.withColumn("total", expr("amount + (amount * 0.1)"))
.show()
)

Output:
+———–+——-+——+——+—-+—–+
|customer_id| name|region|amount| tax|total|
+———–+——-+——+——+—-+—–+
| 1| Alice| North| 100|10.0|110.0|
| 2| Bob| South| 150|15.0|165.0|
| 3|Charlie| North| 200|20.0|220.0|
| 4| Diana| East| 120|12.0|132.0|
| 5| Eve| West| 180|18.0|198.0|
+———–+——-+——+——+—-+—–+

Unlike select() which uses column objects and method chaining, selectExpr() accepts SQL strings and is preferred for complex expressions that are simpler to write as SQL:
# Select columns with calculations and CASE statement using SQL syntax
df.selectExpr(
"customer_id",
"name",
"amount * 1.1 AS amount_with_tax",
"CASE WHEN amount > 150 THEN 'high' ELSE 'normal' END as category"
).show()

Output:
+———–+——-+—————+——–+
|customer_id| name|amount_with_tax|category|
+———–+——-+—————+——–+
| 1| Alice| 110.0| normal|
| 2| Bob| 165.0| normal|
| 3|Charlie| 220.0| high|
| 4| Diana| 132.0| normal|
| 5| Eve| 198.0| high|
+———–+——-+—————+——–+

Conclusion
You’ve learned PySpark SQL from fundamentals to advanced analytics. Here’s a quick reference:

Category
Methods
Description

DataFrame Creation
createDataFrame(), read.csv(), read.parquet()
Create DataFrames from Python data, CSV, or Parquet files

Data Exploration
show(), describe(), count(), columns, distinct(), sample()
Inspect and explore DataFrame contents

Selection & Filtering
select(), filter(), where(), drop()
Choose columns and filter rows

Column Operations
withColumn(), withColumnRenamed(), cast(), col(), lit()
Add, modify, rename columns and change types

Aggregations
groupBy(), agg(), sum(), avg(), count(), max(), min(), collect_list()
Group data and compute statistics

String Functions
concat(), split(), regexp_replace()
Manipulate text data

Date/Time Functions
year(), month(), dayofmonth(), date_format(), to_timestamp()
Extract and format date components

Window Functions
Window.partitionBy(), rank(), row_number(), dense_rank(), lag(), rowsBetween()
Rankings, time series, and running calculations

Joins
join()
Combine DataFrames with inner, left, right joins

SQL Integration
createOrReplaceTempView(), spark.sql()
Execute SQL queries on DataFrames

Custom Functions
@pandas_udf, applyInPandas()
Vectorized custom functions and group operations

SQL Expressions
expr(), selectExpr()
Embed SQL syntax in DataFrame operations

PySpark SQL gives you the power to scale from prototyping to production without rewriting your workflows, making it an essential tool for modern data science.
Next Steps:
Expand your PySpark expertise with these related tutorials:

What’s New in PySpark 4.0 for Arrow UDFs, native visualization, and dynamic UDTFs
Scaling Pandas Workflows with PySpark’s Pandas API for pandas-compatible distributed computing
Natural-Language Queries for Spark: Using LangChain for AI-powered SQL generation
Writing Safer PySpark Queries with Parameters for secure, reusable queries
Delta Lake: Transform pandas Prototypes into Production for production-grade data lakes

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs Read More »

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Table of Contents

Introduction
Bad solution: convert all user input to pandas
Unmaintainable solution: write separate code for all input libraries
Best solution: express your logic once using Narwhals
What happens when libraries evolve?
Conclusion

Introduction
Have you ever needed to convert a function to work with pandas, Polars, DuckDB, or PySpark DataFrames? If you were a data scientist in 2022, you might have just used pandas and called it a day.
df = pd.DataFrame({
"date": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],
"price": [1, 4, 3],
})

def monthly_aggregate_pandas(user_df):
return user_df.resample("MS", on="date")[["price"]].mean()

monthly_aggregate_pandas(df)

But it’s 2025 now — if you try doing that today, you’ll quickly run into complaints:

Another team prefers using Polars.
Your lead data engineer wants to deploy using PySpark.
Another data engineer argues that DuckDB is all you need.
Your colleague would prefer using PyArrow due to its great interoperability.

Indeed, choosing a dataframe library is a common pain point. With all the dataframe libraries out there, each with its own API, how do you make a good, future-proof decision and avoid lock-in?
In particular, how do you write reusable and maintainable functions that can work with any major dataframe library?
This article will walk through the limitations of naive conversion, the complexity of maintaining separate logic for each DataFrame library, and how Narwhals offers a clean, unified way to express your DataFrame logic once and run it anywhere.
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!
Bad solution: convert all user input to pandas
You could make your tool appear dataframe-agnostic by just converting the user input to pandas.
def monthly_aggregate_bad(user_df):
if isinstance(user_df, pd.DataFrame):
df = user_df
elif isinstance(user_df, pl.DataFrame):
df = user_df.to_pandas()
elif isinstance(user_df, duckdb.DuckDBPyRelation):
df = user_df.df()
elif isinstance(user_df, pa.Table):
df = user_df.to_pandas()
elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):
df = user_df.toPandas()
else:
raise TypeError("Unsupported DataFrame type: cannot convert to pandas")
return df.resample("MS", on="date")[["price"]].mean()

Use the monthly_aggregate_bad function for different types of DataFrames:
data = {
"date": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],
"price": [1, 4, 3],
}

# pandas
pandas_df = pd.DataFrame(data)
monthly_aggregate_bad(pandas_df)

# polars
polars_df = pl.DataFrame(data)
monthly_aggregate_bad(polars_df)

# duckdb
duckdb_df = duckdb.from_df(pandas_df)
monthly_aggregate_bad(duckdb_df)

# pyspark
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(pandas_df)
monthly_aggregate_bad(spark_df)

# pyarrow
arrow_table = pa.table(data)
monthly_aggregate_bad(arrow_table)

However, this is a missed opportunity, as you may lose out on:

Lazy API optimisations.
GPU acceleration.
Multithreading.
Pandas becomes a required dependency for everyone.

If you want to appease your data engineers, you’ll need to support modern data tools natively.
Unmaintainable solution: write separate code for all input libraries
Having decided that you need to support pandas, Polars, PySpark, DuckDB, and PyArrow natively, you may decide to write a separate branch for each input kind:
def monthly_aggregate_unmaintainable(user_df):
if isinstance(user_df, pd.DataFrame):
result = user_df.resample("MS", on="date")[["price"]].mean()
elif isinstance(user_df, pl.DataFrame):
result = (
user_df.group_by(pl.col("date").dt.truncate("1mo"))
.agg(pl.col("price").mean())
.sort("date")
)
elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):
result = (
user_df.withColumn("date_month", F.date_trunc("month", F.col("date")))
.groupBy("date_month")
.agg(F.mean("price").alias("price_mean"))
.orderBy("date_month")
)
# TODO: more branches for DuckDB, PyArrow, Dask, etc…
return result

Then use the monthly_aggregate_unmaintainable function for different types of DataFrames:
# pandas
monthly_aggregate_unmaintainable(pandas_df)

# polars
monthly_aggregate_unmaintainable(polars_df)

# pyspark
monthly_aggregate_unmaintainable(spark_df)

Maintaining separate code for each DataFrame library quickly becomes unmanageable. Every new library introduces more branching logic, more surface area for bugs, and more overhead when requirements change. Surely, there’s a better way?
Best solution: express your logic once using Narwhals
Narwhals is an extremely lightweight compatibility layer between dataframes and is used by Plotly, Marimo, Altair, Bokeh, and more. It allows you to express dataframe logic just once, with a unified API. Using Narwhals, the complicated code above becomes:
import narwhals as nw
from narwhals.typing import IntoFrameT

def monthly_aggregate(user_df: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(user_df)
.group_by(nw.col("date").dt.truncate("1mo"))
.agg(nw.col("price").mean())
.sort("date")
.to_native()
)

Use the monthly_aggregate function for different types of DataFrames:
# pandas
monthly_aggregate(pandas_df)

# polars
monthly_aggregate(polars_df)

# duckdb
monthly_aggregate(duckdb_df)

# pyarrow
monthly_aggregate(arrow_table)

# pyspark
monthly_aggregate(spark_df)

Much simpler! Code written like this can accept inputs from all major dataframe libraries, without any extra required dependencies! The user brings their own dataframe and gets their result. It also addresses other pain points faced by data science tool builders:

Full static typing.
Strong backwards-compatibility promises.
Minimal overhead.

Careful readers may have noticed that this looks a lot like the Polars solution. Indeed, the Narwhals API is a subset of the Polars API. Check the Narwhals documentation for more examples and tutorials.
What happens when libraries evolve?
Library APIs change over time — functions get deprecated, method signatures shift, or behavior becomes inconsistent across versions. Narwhals is built to absorb that churn by staying compatible with older and newer versions of the libraries it wraps. This means if you write a function with Narwhals today, it’s far more likely to continue working tomorrow, and no rewrites are required.
If you want to go further and guard against changes in Narwhals itself, you can use its stable API, which, like Rust’s Editions, is intended to remain indefinitely backwards compatible.
Conclusion
We’ve looked at how to write reusable and maintainable data science functions that support all major dataframe libraries. Keeping code maintainable in the face of all the DataFrame libraries is a common pain point for data scientists. Rather than just converting everything to pandas, a better solution is to use Narwhals as a unified dataframe interface. Next time you write a data science function and want to avoid dataframe library lock-in, Narwhals is your friend!
This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark Read More »

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Table of Contents

Introduction
Why Consider Polars?
Setup
Multi-Core Performance
Timing Decorator
Plotting Setup
Reading CSV Files
Groupby Mean
Filter Rows
Sort by Column
Observations

Lazy Evaluation (Only in Polars)
Syntax Comparison
Filtering rows
Selecting columns
Chained operations

Memory Efficiency
Missing Features (Where Pandas Wins)
Summary
When to Use Polars vs. Pandas

Final Thoughts

Introduction
Pandas has been the standard tool for working with tabular data in Python for many years. But as datasets get larger and performance becomes more important, Polars (an increasingly popular choice for handling data in Polars Python) is a faster alternative that uses multiple CPU cores.
This guide compares Polars and Pandas, highlighting where Polars shines and showing how to get started with it.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Key Takeaways
Here’s what you’ll learn:

Achieve 9× faster CSV reading with Polars’ multi-threaded architecture
Reduce memory usage by 30-50% using Apache Arrow backend optimization
Enable lazy evaluation for automatic query optimization and predicate pushdown
Implement seamless migration from Pandas with similar API patterns
Scale data processing without changing code using built-in parallel execution

Why Consider Polars?
Polars is a DataFrame library written in Rust, designed for high performance. Key advantages:

Multi-threaded by default
Lazy execution for optimization
Low memory usage
Strong support for Arrow and Parquet

Let’s explore how Polars improves upon Pandas with practical examples.
Setup
Install both libraries and generate a sample dataset for testing:
pip install pandas polars

To follow along with the examples, you can generate a large CSV file like this:
import pandas as pd
import numpy as np

# Create a large dataset
n_rows = 10_000_000
data = {
"category": np.random.choice(["A", "B", "C", "D"], size=n_rows),
"value": np.random.rand(n_rows) * 1000
}
df = pd.DataFrame(data)

This will create a large_file.csv with 10 million rows for performance testing.
Multi-Core Performance
Pandas uses multiple CPU cores for data processing, while Pandas relies on a single core. Thus, Polars manages larger datasets more efficiently and often performs operations faster than Pandas.

Let’s benchmark a few common operations using Pandas and Polars: reading CSVs, groupby, filter, and sort.
Timing Decorator
We define a reusable timing decorator to simplify benchmarking in this section:
import time
from functools import wraps

def timeit(operation_name):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
execution_time = end_time – start_time
print(f"{operation_name} took {execution_time:.2f} seconds")
return result, execution_time
return wrapper
return decorator

Plotting Setup
Let’s define a helper function to simplify chart creation for each operation.
import matplotlib.pyplot as plt
import seaborn as sns

def create_comparison_plot(pandas_time, polars_time, title, plt, sns):
sns.barplot(
hue=["Pandas", "Polars"],
y=[pandas_time, polars_time],
palette=["#E583B6", "#72BEFA"],
)
plt.title(f"{title} (seconds)")
plt.ylabel("Time (s)")
plt.show()

Reading CSV Files
Pandas:
@timeit("Pandas read_csv")
def read_pandas():
return pd.read_csv("large_file.csv")

pandas_df, pandas_read_time = read_pandas()

Output:
Pandas read_csv took 1.38 seconds

Polars:
import polars as pl

@timeit("Polars read_csv")
def read_polars():
return pl.read_csv("large_file.csv")

polars_df, polars_read_time = read_polars()

Output:
Polars read_csv took 0.15 seconds

create_comparison_plot(pandas_read_time, polars_read_time, "CSV Read Time")

Groupby Mean
Pandas:
@timeit("Pandas groupby")
def pandas_groupby(df):
return df.groupby("category")["value"].mean()

pandas_result, pandas_groupby_time = pandas_groupby(pandas_df)

Output:
Pandas groupby took 0.53 seconds

Polars:
@timeit("Polars groupby")
def polars_groupby(df):
return df.group_by("category").agg(pl.col("value").mean())

polars_result, polars_groupby_time = polars_groupby(polars_df)

Output:
Polars groupby took 0.20 seconds

create_comparison_plot(pandas_groupby_time, polars_groupby_time, "Groupby Mean Time")

Filter Rows
Pandas:
@timeit("Pandas filter")
def pandas_filter(df):
return df[df["value"] > 500]

pandas_filtered, pandas_filter_time = pandas_filter(pandas_df)

Output:
Pandas filter took 0.15 seconds

Polars:
@timeit("Polars filter")
def polars_filter(df):
return df.filter(pl.col("value") > 500)

polars_filtered, polars_filter_time = polars_filter(polars_df)

Output:
Polars filter took 0.03 seconds

create_comparison_plot(pandas_filter_time, polars_filter_time, "Filter Rows Time")

Sort by Column
Pandas:
@timeit("Pandas sort")
def pandas_sort(df):
return df.sort_values("value")

pandas_sorted, pandas_sort_time = pandas_sort(pandas_df)

Output:
Pandas sort took 2.74 seconds

Polars:
@timeit("Polars sort")
def polars_sort(df):
return df.sort("value")

polars_sorted, polars_sort_time = polars_sort(polars_df)

Output:
Polars sort took 0.49 seconds

create_comparison_plot(pandas_sort_time, polars_sort_time, "Sort Time")

Observations
Polars consistently outperformed Pandas across all operations:

CSV Reading: 9× faster
Groupby: Over 2.5× faster
Filtering: 5× faster
Sorting: 5.5× faster

This performance advantage is largely due to Polars’ native Rust engine and multi-threaded execution. It’s especially valuable for large-scale data processing pipelines.
For SQL-based analytics that complement Polars’ performance advantages, see our DuckDB deep dive guide.
Lazy Evaluation (Only in Polars)
Pandas executes operations eagerly, which can be inefficient for large pipelines. For example:
(
pd.read_csv("large_file.csv")
.query('value > 100')
.groupby('category', as_index=False)['value']
.mean()
.rename(columns={"value": "avg_value"})
)

This code is inefficient because:

Full CSV Load: Pandas loads the entire file into memory, even rows or columns we won’t use.
No Predicate Pushdown: The .query() filter is applied after reading all rows, not during the read.
No Projection Pushdown: All columns are read from disk, even if we only use value and category.

Polars supports lazy evaluation, which plans and optimizes the entire computation graph before execution:
query = (
pl.read_csv("large_file.csv").lazy()
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value"))
)

print(query.explain())

Output:
FILTER [(col("avg_value")) > (100.0)]
FROM
AGGREGATE
[col("value").mean().alias("avg_value")] BY [col("category")]
FROM
DF ["category", "value"]; PROJECT["value", "category"] 2/2 COLUMNS

Optimizations used by Polars:

Lazy Execution: The computation is deferred and optimized before execution.
Predicate Pushdown: Filters like value > 100 are applied during the scan, skipping irrelevant rows.
Projection Pushdown: Only necessary columns ( category and value) are read.
Single-Pass Execution: Operations can be executed in a streaming manner, improving performance.

The pipeline is only executed when .collect() is called:
result = query.collect()
result.head()

Output:
shape: (4, 2)
┌──────────┬────────────┐
│ category ┆ avg_value │
│ — ┆ — │
│ str ┆ f64 │
╞══════════╪════════════╡
│ A ┆ 550.245739 │
│ C ┆ 549.985808 │
│ D ┆ 549.649933 │
│ B ┆ 550.055504 │
└──────────┴────────────┘

Syntax Comparison
Both libraries offer expressive APIs, but Polars’ syntax is more functional and chainable, which makes it easier to write pipelines with multiple operations in a single expression.
Filtering rows
Pandas:
df_pd.query("value > 1000")

Polars:
df_pl.filter(pl.col("value") > 1000)

Selecting columns
Pandas:
df_pd[["category", "value"]]

Polars:
df_pl.select(["category", "value"])

Polars feels more like writing SQL pipelines, especially in lazy mode.
Chained operations
Pandas:
(df_pd.query("value > 1000")
.groupby("category")["value"]
.mean()
.reset_index())

Polars:
(df_pl
.filter(pl.col("value") > 1000)
.group_by("category")
.agg(pl.col("value").mean().alias("avg_value")))

Polars lets you chain operations together, which often makes code more readable and efficient.
Memory Efficiency
Polars has its own columnar and compact implementation optimized for performance. This design avoids Python object overhead, unlike Pandas.
df_pd.memory_usage(deep=True).sum() / 1e6 # MB
df_pl.estimated_size() / 1e6 # MB

Polars usually consumes less memory, especially with strings or categorical data.
Missing Features (Where Pandas Wins)
While Polars is fast, Pandas is still better at:

Interoperability with other Python libraries (e.g., matplotlib, scikit-learn)
Rich ecosystem (e.g., pandas-profiling, pandasql)
More mature and flexible time series tools

Summary
When to Use Polars vs. Pandas

Scenario
Use Polars ✅
Use Pandas ✅

Large datasets and file sizes
✅ Yes
⚠️ May struggle with memory

Fast CSV/Parquet I/O
✅ Much faster
🚫 Slower for large files

Multi-core performance
✅ Built-in multi-threading
🚫 Single-threaded by default

Lazy evaluation for optimized pipelines
✅ Supported
🚫 Not available

Low memory consumption
✅ Apache Arrow backend
🚫 Higher memory overhead

Integration with scikit-learn, statsmodels, etc.
🚫 Limited
✅ Excellent ecosystem integration

Flexible time series handling
🚫 Basic support
✅ Advanced and robust

Small-to-medium datasets
⚠️ May be overkill
✅ Simple and effective

For scaling beyond single-machine processing, see our PySpark Pandas API guide.
Final Thoughts
Polars is not here to replace Pandas, but it’s an excellent choice when performance matters. If your workflows are slow or memory-bound, Polars can drastically reduce execution time without compromising expressiveness.
Tip: You can even mix both libraries. Load and process large files with Polars, then convert to Pandas when needed:
df_pd = df_pl.to_pandas()

For modern Python development best practices that complement Polars’ performance focus, see our UV package manager guide.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran