What is DuckDB?

DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.

It’s perfect for data scientists because:

  1. Zero Configuration: No database server setup required
  2. Memory Efficiency: Out-of-core processing for datasets larger than RAM
  3. Familiar Interface: SQL syntax with shortcuts like GROUP BY ALL
  4. Performance: Columnar-vectorized engine faster than pandas
  5. Universal Access: Query files, cloud storage, and external databases

Installation

Install DuckDB with pip:

pip install duckdb

Let’s verify the installation:

Python
Output
Loading Python…

Zero Configuration

SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:

  • Install and configure a database server
  • Ensure the service is running
  • Set up credentials and connections
  • Write the DataFrame to a table first

DuckDB eliminates this overhead. Query DataFrames directly with SQL:

Python
Output
Loading Python…
💡 What the output shows

Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.

Try it

Edit the query to select items with quantity greater than 30 from the inventory DataFrame:

Python
Output
Loading Python…
💡 Solution

python result = duckdb.sql("SELECT * FROM inventory WHERE quantity > 30").df()

Quiz

In the code above, how does DuckDB access the sales DataFrame?

Integrate Seamlessly with pandas and Polars

Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?

DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.

Python
Output
💡 What the output shows

DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.

DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:

  • Use pandas for data cleaning and feature engineering
  • Use DuckDB for complex aggregations and complex queries
Python
Output
Loading Python…
💡 What the output shows

pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.

Now use DuckDB for complex aggregations:

Python
Output
Loading Python…
💡 What the output shows

DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.

Try it

Edit the query to combine results from both df_2023 and df_2024 using UNION ALL:

Python
Output
Loading Python…
💡 Solution

python result = duckdb.sql("SELECT * FROM df_2023 UNION ALL SELECT * FROM df_2024").df()

Quiz

What makes DuckDB’s approach to complex aggregations more readable than pandas?

Memory Efficiency

Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.

First, create a sample CSV file:

Python
Output
Loading Python…

With pandas, filtering loads ALL records into RAM first:

Python
Output
Loading Python…

With DuckDB, only matching rows enter memory:

Python
Output
Loading Python…

The diagram below summarizes the memory difference:

RAM Usage
│
│ ████████████  Pandas (loads all 1M rows)
│
│ ██            DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────

Quiz

What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?

Out-of-Core Processing

Fast Performance

While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:

Pandas                         DuckDB
  │                              │
  ├─ Row 1 ──────> process       ├─ Chunk 1 (2048 rows) ─┐
  ├─ Row 2 ──────> process       ├─ Chunk 2 (2048 rows) ─┼─> process
  ├─ Row 3 ──────> process       ├─ Chunk 3 (2048 rows) ─┘
  ├─ Row 4 ──────> process       │
  │   ...                        │
  ▼                              ▼
  Sequential                   Parallel chunks

This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.

Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.

Python
Output
💡 What the output shows

DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.

📝 Note

The following benchmark was run on native Python. Results may vary in browser-based environments.

Quiz

How does pandas process data differently from DuckDB?

FROM-First Syntax

Traditional SQL requires SELECT before FROM. This adds unnecessary boilerplate when you just want a quick look at your data:

Python
Output
Loading Python…

DuckDB lets you skip SELECT * entirely, making quick data exploration faster:

Python
Output
Loading Python…
💡 What the output shows

Notice the results are the same. This confirms that FROM table automatically selects all columns.

Try it

Write a FROM-first query to get all sales with amount > 150:

Python
Output
Loading Python…
💡 Solution

python result = duckdb.sql("FROM sales WHERE amount > 150").df()

Quiz

What happens when you run FROM sales in DuckDB?

GROUP BY ALL

SELECT * EXCLUDE

SELECT * REPLACE

Streamlined File Reading

Query Cloud Storage

Automatic Parsing of CSV Files

Automatic Flattening of Nested Parquet Files

Automatic Flattening of Nested JSON Files

Reading Multiple Files

Hive Partitioned Datasets

Exporting Data

Creating Lists, Structs, and Maps

Manipulating Nested Data

Parameterized Queries

When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.

First, let’s create a sample products table:

Python
Output
Loading Python…

You might use f-strings to pass parameters to your queries:

Python
Output
Loading Python…
⚠ Caution

While this works, f-strings are dangerous. A malicious user could:

  • Input "0; DROP TABLE products; --" to delete your table
  • Input "0 UNION SELECT * FROM secrets" to steal data

DuckDB provides a safer way with parameterized queries using the ? placeholder:

Python
Output
Loading Python…
💡 What the output shows

DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.

Try it

Use the ? placeholder to find products under $300:

Python
Output
Loading Python…
💡 Solution

python "SELECT * FROM products WHERE price < ?", (max_price,)

The ? placeholder gets replaced with the value from the tuple. The trailing comma is required for single-element tuples.

Quiz

If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?

ACID Transactions

Attach External Databases

Key Takeaways

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran