What is DuckDB?
DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.
It’s perfect for data scientists because:
- Zero Configuration: No database server setup required
- Memory Efficiency: Out-of-core processing for datasets larger than RAM
- Familiar Interface: SQL syntax with shortcuts like
GROUP BY ALL - Performance: Columnar-vectorized engine faster than pandas
- Universal Access: Query files, cloud storage, and external databases
Installation
Install DuckDB with pip:
pip install duckdb
Let’s verify the installation:
Zero Configuration
SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:
- Install and configure a database server
- Ensure the service is running
- Set up credentials and connections
- Write the DataFrame to a table first
DuckDB eliminates this overhead. Query DataFrames directly with SQL:
Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.
Try it
Edit the query to select items with quantity greater than 30 from the inventory DataFrame:
💡 Solution
“python result = duckdb.sql("SELECT * FROM inventory WHERE quantity > 30").df() “
Quiz
In the code above, how does DuckDB access the sales DataFrame?
Integrate Seamlessly with pandas and Polars
Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?
DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.
DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.
DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:
- Use pandas for data cleaning and feature engineering
- Use DuckDB for complex aggregations and complex queries
pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.
Now use DuckDB for complex aggregations:
DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.
Try it
Edit the query to combine results from both df_2023 and df_2024 using UNION ALL:
💡 Solution
“python result = duckdb.sql("SELECT * FROM df_2023 UNION ALL SELECT * FROM df_2024").df() “
Quiz
What makes DuckDB’s approach to complex aggregations more readable than pandas?
Memory Efficiency
Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.
First, create a sample CSV file:
With pandas, filtering loads ALL records into RAM first:
With DuckDB, only matching rows enter memory:
The diagram below summarizes the memory difference:
RAM Usage
│
│ ████████████ Pandas (loads all 1M rows)
│
│ ██ DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────
Quiz
What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?
Out-of-Core Processing
Fast Performance
While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:
Pandas DuckDB
│ │
├─ Row 1 ──────> process ├─ Chunk 1 (2048 rows) ─┐
├─ Row 2 ──────> process ├─ Chunk 2 (2048 rows) ─┼─> process
├─ Row 3 ──────> process ├─ Chunk 3 (2048 rows) ─┘
├─ Row 4 ──────> process │
│ ... │
▼ ▼
Sequential Parallel chunks
This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.
Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.
DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.
The following benchmark was run on native Python. Results may vary in browser-based environments.
Quiz
How does pandas process data differently from DuckDB?
FROM-First Syntax
Traditional SQL requires SELECT before FROM. This adds unnecessary boilerplate when you just want a quick look at your data:
DuckDB lets you skip SELECT * entirely, making quick data exploration faster:
Notice the results are the same. This confirms that FROM table automatically selects all columns.
Try it
Write a FROM-first query to get all sales with amount > 150:
💡 Solution
“python result = duckdb.sql("FROM sales WHERE amount > 150").df() “
Quiz
What happens when you run FROM sales in DuckDB?
GROUP BY ALL
SELECT * EXCLUDE
SELECT * REPLACE
Streamlined File Reading
Query Cloud Storage
Automatic Parsing of CSV Files
Automatic Flattening of Nested Parquet Files
Automatic Flattening of Nested JSON Files
Reading Multiple Files
Hive Partitioned Datasets
Exporting Data
Creating Lists, Structs, and Maps
Manipulating Nested Data
Parameterized Queries
When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.
First, let’s create a sample products table:
You might use f-strings to pass parameters to your queries:
While this works, f-strings are dangerous. A malicious user could:
- Input
"0; DROP TABLE products; --"to delete your table - Input
"0 UNION SELECT * FROM secrets"to steal data
DuckDB provides a safer way with parameterized queries using the ? placeholder:
DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.
Try it
Use the ? placeholder to find products under $300:
💡 Solution
“python "SELECT * FROM products WHERE price < ?", (max_price,) “
The ? placeholder gets replaced with the value from the tuple. The trailing comma is required for single-element tuples.
Quiz
If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?