DuckDB for Data Scientists

What is DuckDB?

DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.

It’s perfect for data scientists because:

Zero Configuration: No database server setup required
Memory Efficiency: Out-of-core processing for datasets larger than RAM
Familiar Interface: SQL syntax with shortcuts like GROUP BY ALL
Performance: Columnar-vectorized engine faster than pandas
Universal Access: Query files, cloud storage, and external databases

Installation

Install DuckDB with pip:

pip install duckdb

Let’s verify the installation:

Zero Configuration

SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:

Install and configure a database server
Ensure the service is running
Set up credentials and connections
Write the DataFrame to a table first

DuckDB eliminates this overhead. Query DataFrames directly with SQL:

💡 What the output shows

Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.

Try it

Edit the query to select items with quantity greater than 30 from the inventory DataFrame:

💡 Solution

“python result = duckdb.sql("SELECT * FROM inventory WHERE quantity > 30").df() “

Quiz

In the code above, how does DuckDB access the sales DataFrame?

Integrate Seamlessly with pandas and Polars

Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?

DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.

💡 What the output shows

DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.

DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:

Use pandas for data cleaning and feature engineering
Use DuckDB for complex aggregations and complex queries

Output

Loading Python…

💡 What the output shows

pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.

Now use DuckDB for complex aggregations:

Output

Loading Python…

💡 What the output shows

DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.

Try it

Edit the query to combine results from both df_2023 and df_2024 using UNION ALL:

💡 Solution

“python result = duckdb.sql("SELECT * FROM df_2023 UNION ALL SELECT * FROM df_2024").df() “

Quiz

What makes DuckDB’s approach to complex aggregations more readable than pandas?

Memory Efficiency

Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.

First, create a sample CSV file:

With pandas, filtering loads ALL records into RAM first:

With DuckDB, only matching rows enter memory:

The diagram below summarizes the memory difference:

RAM Usage
│
│ ████████████  Pandas (loads all 1M rows)
│
│ ██            DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────

Quiz

What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?

Out-of-Core Processing

Fast Performance

While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:

Pandas                         DuckDB
  │                              │
  ├─ Row 1 ──────> process       ├─ Chunk 1 (2048 rows) ─┐
  ├─ Row 2 ──────> process       ├─ Chunk 2 (2048 rows) ─┼─> process
  ├─ Row 3 ──────> process       ├─ Chunk 3 (2048 rows) ─┘
  ├─ Row 4 ──────> process       │
  │   ...                        │
  ▼                              ▼
  Sequential                   Parallel chunks

This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.

Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.

Output

💡 What the output shows

DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.

📝 Note

The following benchmark was run on native Python. Results may vary in browser-based environments.

Quiz

How does pandas process data differently from DuckDB?

FROM-First Syntax

Traditional SQL requires SELECT before FROM. This adds unnecessary boilerplate when you just want a quick look at your data:

DuckDB lets you skip SELECT * entirely, making quick data exploration faster:

💡 What the output shows

Notice the results are the same. This confirms that FROM table automatically selects all columns.

Try it

Write a FROM-first query to get all sales with amount > 150:

💡 Solution

“python result = duckdb.sql("FROM sales WHERE amount > 150").df() “

Quiz

What happens when you run FROM sales in DuckDB?

GROUP BY ALL

SELECT * EXCLUDE

SELECT * REPLACE

Streamlined File Reading

Query Cloud Storage

Automatic Parsing of CSV Files

Automatic Flattening of Nested Parquet Files

Automatic Flattening of Nested JSON Files

Reading Multiple Files

Hive Partitioned Datasets

Exporting Data

Creating Lists, Structs, and Maps

Manipulating Nested Data

Parameterized Queries

When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.

First, let’s create a sample products table:

You might use f-strings to pass parameters to your queries:

⚠ Caution

While this works, f-strings are dangerous. A malicious user could:

Input "0; DROP TABLE products; --" to delete your table
Input "0 UNION SELECT * FROM secrets" to steal data

DuckDB provides a safer way with parameterized queries using the ? placeholder:

💡 What the output shows

DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.

Try it

Use the ? placeholder to find products under $300:

Output

Loading Python…

💡 Solution

“python "SELECT * FROM products WHERE price < ?", (max_price,) “

The ? placeholder gets replaced with the value from the tuple. The trailing comma is required for single-element tuples.

Quiz

If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?

What is DuckDB?

Installation

Zero Configuration

Try it

Quiz

Integrate Seamlessly with pandas and Polars

Try it

Quiz

Memory Efficiency

Quiz

Out-of-Core Processing

Fast Performance

Quiz

FROM-First Syntax

Try it

Quiz

GROUP BY ALL

SELECT * EXCLUDE

SELECT * REPLACE

Streamlined File Reading

Query Cloud Storage

Automatic Parsing of CSV Files

Automatic Flattening of Nested Parquet Files

Automatic Flattening of Nested JSON Files

Reading Multiple Files

Hive Partitioned Datasets

Exporting Data

Creating Lists, Structs, and Maps

Manipulating Nested Data

Parameterized Queries

Try it

Quiz

ACID Transactions

Attach External Databases

Key Takeaways

What is DuckDB?

Installation

Zero Configuration

Try it

Quiz

Integrate Seamlessly with pandas and Polars

Try it

Quiz

Memory Efficiency

Quiz

Out-of-Core Processing

Fast Performance

Quiz

FROM-First Syntax

Try it

Quiz

GROUP BY ALL

SELECT * EXCLUDE

SELECT * REPLACE

Streamlined File Reading

Query Cloud Storage

Automatic Parsing of CSV Files

Automatic Flattening of Nested Parquet Files

Automatic Flattening of Nested JSON Files

Reading Multiple Files

Hive Partitioned Datasets

Exporting Data

Creating Lists, Structs, and Maps

Manipulating Nested Data

Parameterized Queries

Try it

Quiz

ACID Transactions

Attach External Databases

Key Takeaways

Course Complete!

Work with Khuyen Tran

Work with Khuyen Tran