What is DuckDB?

DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.

It’s perfect for data scientists because:

  1. Zero Configuration: No database server setup required
  2. Memory Efficiency: Out-of-core processing for datasets larger than RAM
  3. Familiar Interface: SQL syntax with shortcuts like GROUP BY ALL
  4. Performance: Columnar-vectorized engine faster than pandas
  5. Universal Access: Query files, cloud storage, and external databases

Installation

Install DuckDB with pip:

pip install duckdb

Let’s verify the installation:

Python
Output
Loading Python…

Zero Configuration

SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:

  • Install and configure a database server
  • Ensure the service is running
  • Set up credentials and connections
  • Write the DataFrame to a table first

DuckDB eliminates this overhead. Query DataFrames directly with SQL:

Python
Output
Loading Python…
💡 What you’ll notice

Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.

In the code above, how does DuckDB access the sales DataFrame?

Integrate Seamlessly with pandas and Polars

Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?

DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.

Python
💡 What you’ll notice

DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.

DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:

  • Use pandas for data cleaning and feature engineering
  • Use DuckDB for complex aggregations and complex queries
Python
Output
Loading Python…
💡 What you’ll notice

pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.

Now use DuckDB for complex aggregations:

Python
Output
Loading Python…
💡 What you’ll notice

DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.

What makes DuckDB’s approach to complex aggregations more readable than pandas?

Memory Efficiency

Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.

First, create a sample CSV file:

Python
Output
Loading Python…

With pandas, filtering loads ALL records into RAM first:

Python
Output
Loading Python…

With DuckDB, only matching rows enter memory:

Python
Output
Loading Python…

The diagram below summarizes the memory difference:

RAM Usage
│
│ ████████████  Pandas (loads all 1M rows)
│
│ ██            DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────

What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?

Out-of-Core Processing

Fast Performance

While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:

Pandas                         DuckDB
  │                              │
  ├─ Row 1 ──────> process       ├─ Chunk 1 (2048 rows) ─┐
  ├─ Row 2 ──────> process       ├─ Chunk 2 (2048 rows) ─┼─> process
  ├─ Row 3 ──────> process       ├─ Chunk 3 (2048 rows) ─┘
  ├─ Row 4 ──────> process       │
  │   ...                        │
  ▼                              ▼
  Sequential                   Parallel chunks

This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.

Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.

Python
💡 What you’ll notice

DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.

📝 Note

The following benchmark was run on native Python. Results may vary in browser-based environments.

How does pandas process data differently from DuckDB?

FROM-First Syntax

GROUP BY ALL

SELECT * EXCLUDE

SELECT * REPLACE

Streamlined File Reading

Query Cloud Storage

Automatic Parsing of CSV Files

Automatic Flattening of Nested Parquet Files

Automatic Flattening of Nested JSON Files

Reading Multiple Files

Hive Partitioned Datasets

Exporting Data

Creating Lists, Structs, and Maps

Manipulating Nested Data

Parameterized Queries

ACID Transactions

Attach External Databases

Key Takeaways

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran