🚀 Cyber Monday Week • 58% OFF EBOOK & 10% OFF PAPERBACK • VIEW THE BOOK 🚀
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Table of Contents

Introduction

pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.

Each tool excels in different scenarios:

Tool Backend Execution Model Best For
pandas C/Python Eager, single-threaded Small datasets, prototyping, ML integration
Polars Rust Lazy/Eager, multi-threaded Large-scale analytics, data pipelines
DuckDB C++ SQL-first, streaming SQL workflows, embedded analytics, file queries

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Tool Strengths at a Glance

pandas

pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

  • Direct compatibility with scikit-learn, statsmodels, and visualization libraries
  • Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
  • Mature time series functionality
  • Familiar syntax that most data scientists already know

Polars

Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

  • Speeds up operations by using all available CPU cores by default
  • Builds a query plan first, then executes only what’s needed
  • Streaming mode for processing datasets larger than RAM
  • Expressive method chaining with a pandas-like API

DuckDB

DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

  • Native SQL syntax with full analytical query support
  • Queries CSV, Parquet, and JSON files directly without loading
  • Uses disk storage automatically when data exceeds available memory
  • Zero-configuration embedded database requiring no server setup

Setup

Install all three libraries:

pip install pandas polars duckdb

Generate sample data for benchmarking:

import pandas as pd
import numpy as np

np.random.seed(42)
n_rows = 5_000_000

data = {
    "category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
    "region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
    "amount": np.random.rand(n_rows) * 1000,
    "quantity": np.random.randint(1, 100, size=n_rows),
}

df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")
Created sales_data.csv with 5,000,000 rows

Syntax Comparison

All three tools can perform the same operations with different syntax. Here’s a side-by-side comparison of common tasks.

Filtering Rows

pandas:
Uses bracket notation with boolean conditions, which is concise but can become hard to read with complex conditions.

import pandas as pd

df_pd = pd.read_csv("sales_data.csv")
result_pd = df_pd[(df_pd["amount"] > 500) & (df_pd["category"] == "Electronics")]
result_pd.head()
category region amount quantity
7 Electronics West 662.803066 80
15 Electronics North 826.004963 25
30 Electronics North 766.081832 7
31 Electronics West 772.084261 36
37 Electronics East 527.967145 35

Polars:
Uses method chaining with pl.col() expressions, avoiding the repeated df["column"] references required by pandas.

import polars as pl

df_pl = pl.read_csv("sales_data.csv")
result_pl = df_pl.filter(
    (pl.col("amount") > 500) & (pl.col("category") == "Electronics")
)
result_pl.head()
category region amount quantity
str str f64 i64
“Electronics” “West” 662.803066 80
“Electronics” “North” 826.004963 25
“Electronics” “North” 766.081832 7
“Electronics” “West” 772.084261 36
“Electronics” “East” 527.967145 35

DuckDB:
Uses standard SQL with a WHERE clause, which is more readable by those who know SQL.

import duckdb

result_duckdb = duckdb.sql("""
    SELECT * FROM 'sales_data.csv'
    WHERE amount > 500 AND category = 'Electronics'
""").df()
result_duckdb.head()
category region amount quantity
0 Electronics West 662.803066 80
1 Electronics North 826.004963 25
2 Electronics North 766.081832 7
3 Electronics West 772.084261 36
4 Electronics East 527.967145 35

Selecting Columns

pandas:
Double brackets return a DataFrame with selected columns.

result_pd = df_pd[["category", "amount"]]
result_pd.head()
category amount
0 Food 516.653322
1 Books 937.337226
2 Electronics 450.941022
3 Food 674.488081
4 Food 188.847906

Polars:
The select() method clearly communicates column selection intent.

result_pl = df_pl.select(["category", "amount"])
result_pl.head()
category amount
str f64
“Food” 516.653322
“Books” 937.337226
“Electronics” 450.941022
“Food” 674.488081
“Food” 188.847906

DuckDB:
SQL’s SELECT clause makes column selection intuitive for SQL users.

result_duckdb = duckdb.sql("""
    SELECT category, amount FROM 'sales_data.csv'
""").df()
result_duckdb.head()
category amount
0 Food 516.653322
1 Books 937.337226
2 Electronics 450.941022
3 Food 674.488081
4 Food 188.847906

GroupBy Aggregation

pandas:
Uses a dictionary to specify aggregations, but returns multi-level column headers that often require flattening before further use.

result_pd = df_pd.groupby("category").agg({
    "amount": ["sum", "mean"],
    "quantity": "sum"
})
result_pd.head()
amount quantity
sum mean sum
Books 6.247506e+08 499.998897 62463285
Clothing 6.253924e+08 500.139837 62505224
Electronics 6.244453e+08 499.938189 62484265
Food 6.254034e+08 499.916417 62577943

Polars:
Uses explicit alias() calls for each aggregation, producing flat column names directly without post-processing.

result_pl = df_pl.group_by("category").agg([
    pl.col("amount").sum().alias("amount_sum"),
    pl.col("amount").mean().alias("amount_mean"),
    pl.col("quantity").sum().alias("quantity_sum"),
])
result_pl.head()
category amount_sum amount_mean quantity_sum
str f64 f64 i64
“Clothing” 6.2539e8 500.139837 62505224
“Books” 6.2475e8 499.998897 62463285
“Electronics” 6.2445e8 499.938189 62484265
“Food” 6.2540e8 499.916417 62577943

DuckDB:
Standard SQL aggregation with column aliases produces clean, flat output ready for downstream use.

result_duckdb = duckdb.sql("""
    SELECT
        category,
        SUM(amount) as amount_sum,
        AVG(amount) as amount_mean,
        SUM(quantity) as quantity_sum
    FROM 'sales_data.csv'
    GROUP BY category
""").df()
result_duckdb.head()
category amount_sum amount_mean quantity_sum
0 Food 6.254034e+08 499.916417 62577943.0
1 Electronics 6.244453e+08 499.938189 62484265.0
2 Clothing 6.253924e+08 500.139837 62505224.0
3 Books 6.247506e+08 499.998897 62463285.0

Adding Columns

pandas:
The assign() method creates new columns with repeated DataFrame references like df_pd["amount"].

result_pd = df_pd.assign(
    amount_with_tax=df_pd["amount"] * 1.1,
    high_value=df_pd["amount"] > 500
)
result_pd.head()
category region amount quantity amount_with_tax high_value
0 Food South 516.653322 40 568.318654 True
1 Books East 937.337226 45 1031.070948 True
2 Electronics North 450.941022 93 496.035124 False
3 Food East 674.488081 46 741.936889 True
4 Food East 188.847906 98 207.732697 False

Polars:
The with_columns() method uses composable expressions that chain naturally without repeating the DataFrame name.

result_pl = df_pl.with_columns([
    (pl.col("amount") * 1.1).alias("amount_with_tax"),
    (pl.col("amount") > 500).alias("high_value")
])
result_pl.head()
category region amount quantity amount_with_tax high_value
str str f64 i64 f64 bool
“Food” “South” 516.653322 40 568.318654 true
“Books” “East” 937.337226 45 1031.070948 true
“Electronics” “North” 450.941022 93 496.035124 false
“Food” “East” 674.488081 46 741.936889 true
“Food” “East” 188.847906 98 207.732697 false

DuckDB:
SQL’s SELECT clause defines new columns directly in the query, keeping transformations readable.

result_duckdb = duckdb.sql("""
    SELECT *,
        amount * 1.1 as amount_with_tax,
        amount > 500 as high_value
    FROM df_pd
""").df()
result_duckdb.head()
category region amount quantity amount_with_tax high_value
0 Food South 516.653322 40 568.318654 True
1 Books East 937.337226 45 1031.070948 True
2 Electronics North 450.941022 93 496.035124 False
3 Food East 674.488081 46 741.936889 True
4 Food East 188.847906 98 207.732697 False

Conditional Logic

pandas:
Requires np.where() for simple conditions or slow apply() for complex logic, which breaks method chaining.

import numpy as np

result_pd = df_pd.assign(
    value_tier=np.where(
        df_pd["amount"] > 700, "high",
        np.where(df_pd["amount"] > 300, "medium", "low")
    )
)
result_pd[["category", "amount", "value_tier"]].head()
category amount value_tier
0 Food 516.653322 medium
1 Books 937.337226 high
2 Electronics 450.941022 medium
3 Food 674.488081 medium
4 Food 188.847906 low

Polars:
The when().then().otherwise() chain is readable and integrates naturally with method chaining.

result_pl = df_pl.with_columns(
    pl.when(pl.col("amount") > 700).then(pl.lit("high"))
      .when(pl.col("amount") > 300).then(pl.lit("medium"))
      .otherwise(pl.lit("low"))
      .alias("value_tier")
)
result_pl.select(["category", "amount", "value_tier"]).head()
category amount value_tier
str f64 str
“Food” 516.653322 “medium”
“Books” 937.337226 “high”
“Electronics” 450.941022 “medium”
“Food” 674.488081 “medium”
“Food” 188.847906 “low”

DuckDB:
Standard SQL CASE WHEN syntax is more readable by those who know SQL.

result_duckdb = duckdb.sql("""
    SELECT category, amount,
        CASE
            WHEN amount > 700 THEN 'high'
            WHEN amount > 300 THEN 'medium'
            ELSE 'low'
        END as value_tier
    FROM df_pd
""").df()
result_duckdb.head()
category amount value_tier
0 Food 516.653322 medium
1 Books 937.337226 high
2 Electronics 450.941022 medium
3 Food 674.488081 medium
4 Food 188.847906 low

Window Functions

pandas:
Uses groupby().transform() which requires repeating the groupby clause for each calculation.

result_pd = df_pd.assign(
    category_avg=df_pd.groupby("category")["amount"].transform("mean"),
    category_rank=df_pd.groupby("category")["amount"].rank(ascending=False)
)
result_pd[["category", "amount", "category_avg", "category_rank"]].head()
category amount category_avg category_rank
0 Food 516.653322 499.916417 604342.0
1 Books 937.337226 499.998897 78423.0
2 Electronics 450.941022 499.938189 685881.0
3 Food 674.488081 499.916417 407088.0
4 Food 188.847906 499.916417 1015211.0

Polars:
The over() expression appends the partition to any calculation, avoiding repeated group definitions.

result_pl = df_pl.with_columns([
    pl.col("amount").mean().over("category").alias("category_avg"),
    pl.col("amount").rank(descending=True).over("category").alias("category_rank")
])
result_pl.select(["category", "amount", "category_avg", "category_rank"]).head()
category amount category_avg category_rank
str f64 f64 f64
“Food” 516.653322 499.916417 604342.0
“Books” 937.337226 499.998897 78423.0
“Electronics” 450.941022 499.938189 685881.0
“Food” 674.488081 499.916417 407088.0
“Food” 188.847906 499.916417 1015211.0

DuckDB:
SQL window functions with OVER (PARTITION BY ...) are the industry standard for this type of calculation.

result_duckdb = duckdb.sql("""
    SELECT category, amount,
        AVG(amount) OVER (PARTITION BY category) as category_avg,
        RANK() OVER (PARTITION BY category ORDER BY amount DESC) as category_rank
    FROM df_pd
""").df()
result_duckdb.head()
category amount category_avg category_rank
0 Clothing 513.807166 500.139837 608257
1 Clothing 513.806596 500.139837 608258
2 Clothing 513.806515 500.139837 608259
3 Clothing 513.806063 500.139837 608260
4 Clothing 513.806056 500.139837 608261

Data Loading Performance

pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.

pandas

Single-threaded CSV parsing loads data sequentially.

┌─────────────────────────────────────────────┐
│ CPU Core 1                                  │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → ... → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2  [idle]                          │
│ CPU Core 3  [idle]                          │
│ CPU Core 4  [idle]                          │
└─────────────────────────────────────────────┘
pandas_time = %timeit -o pd.read_csv("sales_data.csv")
1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars

Multi-threaded parsing reads multiple chunks simultaneously.

┌─────────────────────────────────────────────┐
│ CPU Core 1  ┌────────────────┐              │
│             │ Chunk 1        │              │
│ CPU Core 2  ┌────────────────┐              │
│             │ Chunk 2        │              │
│ CPU Core 3  ┌────────────────┐              │
│             │ Chunk 3        │              │
│ CPU Core 4  ┌────────────────┐              │
│             │ Chunk 4        │              │
└─────────────────────────────────────────────┘
polars_time = %timeit -o pl.read_csv("sales_data.csv")
137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB

Vectorized execution processes data in batches across cores.

┌─────────────────────────────────────────────┐
│ CPU Core 1                                  │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐         │
│ │ Batch 1 │ │ Batch 2 │ │ Batch 3 │  ...    │
│ │  1024   │ │  1024   │ │  1024   │         │
│ │  rows   │ │  rows   │ │  rows   │         │
│ └─────────┘ └─────────┘ └─────────┘         │
└─────────────────────────────────────────────┘
duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()
762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")
Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas

While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.

Query Optimization

pandas: No Optimization

pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.

┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows          → 10M rows in memory         │
│ Step 2: Filter (amount > 100)  → 5M rows in memory          │
│ Step 3: GroupBy                → New DataFrame              │
│ Step 4: Mean                   → Final result               │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high - stores all intermediates)
def pandas_query():
    return (
        pd.read_csv("sales_data.csv")
        .query('amount > 100')
        .groupby('category')['amount']
        .mean()
    )

pandas_opt_time = %timeit -o pandas_query()
1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

  • Full CSV load: All rows are read before filtering
  • No predicate pushdown: Rows are filtered after loading the entire file into memory
  • No projection pushdown: All columns are loaded, even unused ones

Polars: Lazy Evaluation

Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:

┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built:                                           │
│   scan_csv → filter → group_by → agg                        │
│                                                             │
│ Optimizations Applied:                                      │
│   • Predicate pushdown (filter during scan)                 │
│   • Projection pushdown (read only needed columns)          │
│   • Multi-threaded execution (parallel across CPU cores)    │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - no intermediate DataFrames)
query_pl = (
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("amount") > 100)
    .group_by("category")
    .agg(pl.col("amount").mean().alias("avg_amount"))
)

# View the optimized query plan
print(query_pl.explain())
AGGREGATE[maintain_order: false]
  [col("amount").mean().alias("avg_amount")] BY [col("category")]
  FROM
  Csv SCAN [sales_data.csv] [id: 4687118704]
  PROJECT 2/4 COLUMNS
  SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

  • Predicate pushdown: SELECTION filters during scan, not after loading
  • Projection pushdown: PROJECT 2/4 COLUMNS reads only what’s needed
  • Operation reordering: Aggregate runs on filtered data, not the full dataset

Execute the optimized query:

def polars_query():
    return (
        pl.scan_csv("sales_data.csv")
        .filter(pl.col("amount") > 100)
        .group_by("category")
        .agg(pl.col("amount").mean().alias("avg_amount"))
        .collect()
    )

polars_opt_time = %timeit -o polars_query()
148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer

DuckDB’s SQL optimizer applies similar optimizations automatically:

┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built:                                           │
│   SQL → Parser → Optimizer → Execution Plan                 │
│                                                             │
│ Optimizations Applied:                                      │
│   • Predicate pushdown (WHERE during scan)                  │
│   • Projection pushdown (SELECT only needed columns)        │
│   • Vectorized execution (process 1024 rows per batch)      │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - streaming execution)
def duckdb_query():
    return duckdb.sql("""
        SELECT category, AVG(amount) as avg_amount
        FROM 'sales_data.csv'
        WHERE amount > 100
        GROUP BY category
    """).df()

duckdb_opt_time = %timeit -o duckdb_query()
245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:

print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")
Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas

Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.

GroupBy Performance

Computing aggregates requires scanning every row, a workload that scales linearly with CPU cores. This makes groupby operations the clearest test of parallel execution.

Let’s load the data for the groupby benchmarks:

# Load data for fair comparison
df_pd = pd.read_csv("sales_data.csv")
df_pl = pl.read_csv("sales_data.csv")

pandas: Single-Threaded

pandas processes groupby operations on a single CPU core, which becomes a bottleneck on large datasets.

┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1                                                  │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Group A → Group B → Group C → Group D → ... → Aggregate │ │
│ └─────────────────────────────────────────────────────────┘ │
│ CPU Core 2  [idle]                                          │
│ CPU Core 3  [idle]                                          │
│ CPU Core 4  [idle]                                          │
└─────────────────────────────────────────────────────────────┘
def pandas_groupby():
    return df_pd.groupby("category")["amount"].mean()

pandas_groupby_time = %timeit -o pandas_groupby()
271 ms ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars: Multi-Threaded

Polars partitions data across CPU cores, computes partial aggregates in parallel, then merges the results.

┌─────────────────────────────────────────────────────────────┐
│ CPU Core 1  ┌──────────────┐                                │
│             │ Group A, B   │ → Partial Aggregate            │
│ CPU Core 2  ┌──────────────┐                                │
│             │ Group C, D   │ → Partial Aggregate            │
│ CPU Core 3  ┌──────────────┐                                │
│             │ Group E, F   │ → Partial Aggregate            │
│ CPU Core 4  ┌──────────────┐                                │
│             │ Group G, H   │ → Partial Aggregate            │
│                      ↓                                      │
│             Final Merge → Result                            │
└─────────────────────────────────────────────────────────────┘
def polars_groupby():
    return df_pl.group_by("category").agg(pl.col("amount").mean())

polars_groupby_time = %timeit -o polars_groupby()
31.1 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Columnar Processing

DuckDB processes batches of 1024 rows sequentially on a single core, using vectorized execution to maximize throughput per CPU instruction.

┌─────────────────────────────────────────────────────────────┐
│  Vectorized Aggregation (single core, 1024 rows per batch)  │
│                                                             │
│  CPU Core 1:                                                │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐            │
│  │ Batch 1 │→│ Batch 2 │→│ Batch 3 │→│ Batch 4 │→ ...       │
│  │  1024   │ │  1024   │ │  1024   │ │  1024   │            │
│  │  rows   │ │  rows   │ │  rows   │ │  rows   │            │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘            │
│       └───────────┴───────────┴───────────┘                 │
│                        ↓                                    │
│              Hash Table Aggregation                         │
└─────────────────────────────────────────────────────────────┘
def duckdb_groupby():
    return duckdb.sql("""
        SELECT category, AVG(amount)
        FROM df_pd
        GROUP BY category
    """).df()

duckdb_groupby_time = %timeit -o duckdb_groupby()
29 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
print(f"Polars is {pandas_groupby_time.average / polars_groupby_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_groupby_time.average / duckdb_groupby_time.average:.1f}× faster than pandas")
Polars is 8.7× faster than pandas
DuckDB is 9.4× faster than pandas

DuckDB and Polars perform similarly (9.4× vs 8.7×), both leveraging parallel execution. DuckDB’s slight edge comes from columnar storage, which groups all category values together for faster scanning.

Memory Efficiency

pandas: Full Memory Load

pandas loads the entire dataset into RAM:

┌─────────────────────────────────────────────────────────────┐
│  RAM                                                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │████████████████████████████████████████████████████████│ │
│  │██████████████████ ALL 10M ROWS ████████████████████████│ │
│  │████████████████████████████████████████████████████████│ │
│  └────────────────────────────────────────────────────────┘ │
│  Usage: 707,495 KB (entire dataset in memory)               │
└─────────────────────────────────────────────────────────────┘
df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")
pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.

Polars: Streaming Mode

Polars can process data in streaming mode, handling chunks without loading everything:

┌─────────────────────────────────────────────────────────────┐
│  RAM                                                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │█                                                       │ │
│  │                    (result only)                       │ │
│  │                                                        │ │
│  └────────────────────────────────────────────────────────┘ │
│  Usage: 0.06 KB (streams chunks, keeps only result)         │
└─────────────────────────────────────────────────────────────┘
result_pl_stream = (
    pl.scan_csv("sales_data.csv")
    .group_by("category")
    .agg(pl.col("amount").mean())
    .collect(streaming=True)
)

polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")
Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:

(
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("amount") > 500)
    .sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk

DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:

┌─────────────────────────────────────────────────────────────┐
│  RAM                              Disk (if needed)          │
│  ┌──────────────────────────┐     ┌──────────────────────┐  │
│  │█                         │     │░░░░░░░░░░░░░░░░░░░░░░│  │
│  │     (up to 500MB)        │  →  │    (overflow here)   │  │
│  │                          │     │                      │  │
│  └──────────────────────────┘     └──────────────────────┘  │
│  Usage: 0.42 KB (spills to disk when RAM full)              │
└─────────────────────────────────────────────────────────────┘
# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")

# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
    SELECT category, AVG(amount) as avg_amount
    FROM 'sales_data.csv'
    GROUP BY category
""").df()

duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")
DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.

print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")
pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)

Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.

Join Operations

Joining tables is one of the most common operations in data analysis. Let’s compare how each tool handles a left join between 1 million orders and 100K customers.

Let’s create two tables for join benchmarking:

# Create orders table (1M rows)
orders_pd = pd.DataFrame({
    "order_id": range(1_000_000),
    "customer_id": np.random.randint(1, 100_000, size=1_000_000),
    "amount": np.random.rand(1_000_000) * 500
})

# Create customers table (100K rows)
customers_pd = pd.DataFrame({
    "customer_id": range(100_000),
    "region": np.random.choice(["North", "South", "East", "West"], size=100_000)
})

# Convert to Polars
orders_pl = pl.from_pandas(orders_pd)
customers_pl = pl.from_pandas(customers_pd)

pandas: Single-Threaded

pandas processes the join on a single CPU core.

┌─────────────────────────────────────────────┐
│ CPU Core 1                                  │
│ ┌─────────────────────────────────────────┐ │
│ │ Row 1 → Row 2 → Row 3 → ... → Row 1M    │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2  [idle]                          │
│ CPU Core 3  [idle]                          │
│ CPU Core 4  [idle]                          │
└─────────────────────────────────────────────┘
def pandas_join():
    return orders_pd.merge(customers_pd, on="customer_id", how="left")

pandas_join_time = %timeit -o pandas_join()
60.4 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Polars: Multi-Threaded

Polars distributes the join across all available CPU cores.

┌─────────────────────────────────────────────┐
│ CPU Core 1  ┌────────────────┐              │
│             │ Rows 1-250K    │              │
│ CPU Core 2  ┌────────────────┐              │
│             │ Rows 250K-500K │              │
│ CPU Core 3  ┌────────────────┐              │
│             │ Rows 500K-750K │              │
│ CPU Core 4  ┌────────────────┐              │
│             │ Rows 750K-1M   │              │
└─────────────────────────────────────────────┘
def polars_join():
    return orders_pl.join(customers_pl, on="customer_id", how="left")

polars_join_time = %timeit -o polars_join()
11.8 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

DuckDB: Vectorized Execution

DuckDB processes rows in batches rather than one at a time.

┌─────────────────────────────────────────────┐
│ CPU Core 1                                  │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐         │
│ │ Batch 1 │ │ Batch 2 │ │ Batch 3 │  ...    │
│ │  1024   │ │  1024   │ │  1024   │         │
│ │  rows   │ │  rows   │ │  rows   │         │
│ └─────────┘ └─────────┘ └─────────┘         │
└─────────────────────────────────────────────┘
def duckdb_join():
    return duckdb.sql("""
        SELECT o.*, c.region
        FROM orders_pd o
        LEFT JOIN customers_pd c ON o.customer_id = c.customer_id
    """).df()

duckdb_join_time = %timeit -o duckdb_join()
55.7 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s compare the performance of the joins:

print(f"Polars is {pandas_join_time.average / polars_join_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_join_time.average / duckdb_join_time.average:.1f}× faster than pandas")
Polars is 5.1× faster than pandas
DuckDB is 1.1× faster than pandas

Polars’ multi-threaded join delivers a 5.1× speedup, significantly outperforming DuckDB’s 1.1× improvement. Parallelization matters more than vectorization for this join size.

Interoperability

All three tools work together seamlessly. Use each tool for what it does best in a single pipeline.

pandas DataFrame to DuckDB

Query pandas DataFrames directly with SQL:

df = pd.DataFrame({
    "product": ["A", "B", "C"],
    "sales": [100, 200, 150]
})

# DuckDB queries pandas DataFrames by variable name
result = duckdb.sql("SELECT * FROM df WHERE sales > 120").df()
print(result)
  product  sales
0       B    200
1       C    150

Polars to pandas

Convert Polars DataFrames when ML libraries require pandas:

df_polars = pl.DataFrame({
    "feature1": [1, 2, 3],
    "feature2": [4, 5, 6],
    "target": [0, 1, 0]
})

# Convert to pandas for scikit-learn
df_pandas = df_polars.to_pandas()
print(type(df_pandas))
<class 'pandas.core.frame.DataFrame'>

DuckDB to Polars

Get query results as Polars DataFrames:

result = duckdb.sql("""
    SELECT category, SUM(amount) as total
    FROM 'sales_data.csv'
    GROUP BY category
""").pl()

print(type(result))
print(result)
<class 'polars.dataframe.frame.DataFrame'>
shape: (4, 2)
┌─────────────┬──────────┐
│ category    ┆ total    │
│ ---         ┆ ---      │
│ str         ┆ f64      │
╞═════════════╪══════════╡
│ Electronics ┆ 6.2445e8 │
│ Food        ┆ 6.2540e8 │
│ Clothing    ┆ 6.2539e8 │
│ Books       ┆ 6.2475e8 │
└─────────────┴──────────┘

Combined Pipeline Example

Each tool has a distinct strength: DuckDB optimizes SQL queries, Polars parallelizes transformations, and pandas integrates with ML libraries. Combine them in a single pipeline to leverage all three:

# Step 1: DuckDB for initial SQL query
aggregated = duckdb.sql("""
    SELECT category, region,
           SUM(amount) as total_amount,
           COUNT(*) as order_count
    FROM 'sales_data.csv'
    GROUP BY category, region
""").pl()

# Step 2: Polars for additional transformations
enriched = (
    aggregated
    .with_columns([
        (pl.col("total_amount") / pl.col("order_count")).alias("avg_order_value"),
        pl.col("category").str.to_uppercase().alias("category_upper")
    ])
    .filter(pl.col("order_count") > 100000)
)

# Step 3: Convert to pandas for visualization or ML
final_df = enriched.to_pandas()
print(final_df.head())
category region  total_amount  order_count  avg_order_value category_upper
0      Food   East  1.563586e+08       312918       499.679004           FOOD
1      Food  North  1.563859e+08       312637       500.215456           FOOD
2  Clothing  North  1.560532e+08       311891       500.345286       CLOTHING
3  Clothing   East  1.565054e+08       312832       500.285907       CLOTHING
4      Food   West  1.560994e+08       312662       499.259318           FOOD

📖 Related: For writing functions that work across pandas, Polars, and PySpark without conversion, see Unified DataFrame Functions.

Decision Matrix

No single tool wins in every scenario. Use these tables to choose the right tool for your workflow.

Performance Summary

Benchmark results from 10 million rows on a single machine:

Operation pandas Polars DuckDB
CSV Read (10M rows) 1.05s 137ms 762ms
GroupBy 271ms 31ms 29ms
Join (1M rows) 60ms 12ms 56ms
Memory Usage 707 MB 0.06 KB (streaming) 0.42 KB (spill-to-disk)

Polars leads in CSV reading (7.7× faster than pandas) and joins (5× faster). DuckDB matches Polars in groupby performance and uses the least memory with automatic spill-to-disk.

Feature Comparison

Each tool makes different trade-offs between speed, memory, and ecosystem integration:

Feature pandas Polars DuckDB
Multi-threading No Yes Yes
Lazy evaluation No Yes N/A (SQL)
Query optimization No Yes Yes
Larger-than-RAM No Streaming Spill-to-disk
SQL interface No Limited Native
ML integration Excellent Good Limited

pandas lacks the performance features that make Polars and DuckDB fast, but remains essential for ML workflows. Choose between Polars and DuckDB based on whether you prefer DataFrame chaining or SQL syntax.

Recommendations

Match your use case to the right tool:

Scenario Recommendation
Small data (<1M rows) pandas
Large data (1M-100M rows) Polars or DuckDB
SQL-preferred workflow DuckDB
DataFrame-preferred workflow Polars
Memory-constrained Polars (streaming) or DuckDB (spill-to-disk)
ML pipeline integration pandas (with Polars for preprocessing)
Production data pipelines Polars

Data size is the primary decision factor. Under 1M rows, pandas simplicity wins. Above that, the 5-10× speedup from Polars or DuckDB justifies the switch.

Final Thoughts

If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:

  • Profile first: Find which pandas operations are slow
  • Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
  • Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() when needed.

Related Resources

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran