What is DuckDB?
DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.
It’s perfect for data scientists because:
- Zero Configuration: No database server setup required
- Memory Efficiency: Out-of-core processing for datasets larger than RAM
- Familiar Interface: SQL syntax with shortcuts like
GROUP BY ALL - Performance: Columnar-vectorized engine faster than pandas
- Universal Access: Query files, cloud storage, and external databases
Installation
Install DuckDB with pip:
pip install duckdb
Let’s verify the installation:
Zero Configuration
SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:
- Install and configure a database server
- Ensure the service is running
- Set up credentials and connections
- Write the DataFrame to a table first
DuckDB eliminates this overhead. Query DataFrames directly with SQL:
Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.
In the code above, how does DuckDB access the sales DataFrame?
Integrate Seamlessly with pandas and Polars
Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?
DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.
DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.
DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:
- Use pandas for data cleaning and feature engineering
- Use DuckDB for complex aggregations and complex queries
pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.
Now use DuckDB for complex aggregations:
DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.
What makes DuckDB’s approach to complex aggregations more readable than pandas?
Memory Efficiency
Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.
First, create a sample CSV file:
With pandas, filtering loads ALL records into RAM first:
With DuckDB, only matching rows enter memory:
The diagram below summarizes the memory difference:
RAM Usage
│
│ ████████████ Pandas (loads all 1M rows)
│
│ ██ DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────
What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?
Out-of-Core Processing
Fast Performance
While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:
Pandas DuckDB
│ │
├─ Row 1 ──────> process ├─ Chunk 1 (2048 rows) ─┐
├─ Row 2 ──────> process ├─ Chunk 2 (2048 rows) ─┼─> process
├─ Row 3 ──────> process ├─ Chunk 3 (2048 rows) ─┘
├─ Row 4 ──────> process │
│ ... │
▼ ▼
Sequential Parallel chunks
This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.
Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.
DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.
The following benchmark was run on native Python. Results may vary in browser-based environments.
How does pandas process data differently from DuckDB?