DuckDB for Data Scientists

What is DuckDB?

DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.

It’s perfect for data scientists because:

Zero Configuration: No database server setup required
Memory Efficiency: Out-of-core processing for datasets larger than RAM
Familiar Interface: SQL syntax with shortcuts like GROUP BY ALL
Performance: Columnar-vectorized engine faster than pandas
Universal Access: Query files, cloud storage, and external databases

Installation

Install DuckDB with pip:

pip install duckdb

Let’s verify the installation:

Zero Configuration

SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:

Install and configure a database server
Ensure the service is running
Set up credentials and connections
Write the DataFrame to a table first

DuckDB eliminates this overhead. Query DataFrames directly with SQL:

💡 What the output shows

Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.

Try it

Edit the query to select items with quantity greater than 30 from the inventory DataFrame:

💡 Solution

“python result = duckdb.sql("SELECT * FROM inventory WHERE quantity > 30").df() “

Quiz

In the code above, how does DuckDB access the sales DataFrame?

Integrate Seamlessly with pandas and Polars

Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?

DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.

💡 What the output shows

DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.

DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:

Use pandas for data cleaning and feature engineering
Use DuckDB for complex aggregations and complex queries

Output

Loading Python…

💡 What the output shows

pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.

Now use DuckDB for complex aggregations:

Output

Loading Python…

💡 What the output shows

DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.

Try it

Edit the query to combine results from both df_2023 and df_2024 using UNION ALL:

💡 Solution

“python result = duckdb.sql("SELECT * FROM df_2023 UNION ALL SELECT * FROM df_2024").df() “

Quiz

What makes DuckDB’s approach to complex aggregations more readable than pandas?

Memory Efficiency

Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.

First, create a sample CSV file:

With pandas, filtering loads ALL records into RAM first:

With DuckDB, only matching rows enter memory:

The diagram below summarizes the memory difference:

RAM Usage
│
│ ████████████  Pandas (loads all 1M rows)
│
│ ██            DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────

Quiz

What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?

Out-of-Core Processing

When your dataset exceeds available RAM, pandas requires workarounds like chunking or Dask. DuckDB solves this with out-of-core processing:

Breaks data into chunks that fit in memory
Processes each chunk separately
Stores intermediate results on disk
Merges results into the final output

Dataset (50GB)         Memory (16GB)           Disk (temp files)
┌──────────┐           ┌──────────┐            ┌──────────────┐
│ Chunk 1  │ ────────> │ Process  │ ─────────> │ result_1.tmp │
│ Chunk 2  │           │ in RAM   │            │ result_2.tmp │
│ Chunk 3  │           └──────────┘            │ result_3.tmp │
│   ...    │                                   └──────┬───────┘
└──────────┘                                          │ merge
                                                      ▼
                                               ┌─────────────┐
                                               │ Final Result│
                                               └─────────────┘

The following example demonstrates this process. We’ll limit DuckDB to 10MB of memory and sort 5 million rows:

Output

💡 What the output shows

The query succeeds despite processing 5 million rows with only 10MB of memory. If you check your temp directory while the query runs, you’ll see temporary files being created. These files are automatically deleted once the query completes.

You can also configure where DuckDB stores temporary files:

Quiz

What does “out-of-core processing” mean?

When running the example above, what happens to the temporary files after the query completes?

Fast Performance

While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:

Pandas                         DuckDB
  │                              │
  ├─ Row 1 ──────> process       ├─ Chunk 1 (2048 rows) ─┐
  ├─ Row 2 ──────> process       ├─ Chunk 2 (2048 rows) ─┼─> process
  ├─ Row 3 ──────> process       ├─ Chunk 3 (2048 rows) ─┘
  ├─ Row 4 ──────> process       │
  │   ...                        │
  ▼                              ▼
  Sequential                   Parallel chunks

This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.

Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.

Output

💡 What the output shows

DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.

📝 Note

The following benchmark was run on native Python. Results may vary in browser-based environments.

Quiz

How does pandas process data differently from DuckDB?

FROM-First Syntax

Traditional SQL requires SELECT before FROM. This adds unnecessary boilerplate when you just want a quick look at your data:

DuckDB lets you skip SELECT * entirely, making quick data exploration faster:

💡 What the output shows

Notice the results are the same. This confirms that FROM table automatically selects all columns.

Try it

Write a FROM-first query to get all sales with amount > 150:

💡 Solution

“python result = duckdb.sql("FROM sales WHERE amount > 150").df() “

Quiz

What happens when you run FROM sales in DuckDB?

GROUP BY ALL

When using GROUP BY, you typically repeat every non-aggregated column:

DuckDB infers grouping columns automatically with GROUP BY ALL:

💡 What the output shows

Both queries produce the same result. GROUP BY ALL automatically detects product and region as grouping columns, so you don’t have to list them twice.

Try it

Rewrite this query using GROUP BY ALL instead of listing columns:

Output

Loading Python…

💡 Solution

“python result = duckdb.sql(""" SELECT region, COUNT(*) as count, AVG(amount) as avg_amount FROM sales GROUP BY ALL """).df() “

Quiz

How does GROUP BY ALL determine which columns to group by?

SELECT * EXCLUDE

When you need all columns except a few, traditional SQL requires listing every column you want:

Output

Loading Python…

DuckDB’s EXCLUDE lets you specify what to remove instead:

💡 What the output shows

The results are identical. EXCLUDE is shorter to write and stays correct even if you add new columns to the table later.

You can also exclude multiple columns at once with a comma-separated list:

Try it

Write a query to get all columns except email from the users table:

💡 Solution

“python result = duckdb.sql("SELECT * EXCLUDE (email) FROM users").df() “

Quiz

What happens when you add a new column to a table queried with SELECT * EXCLUDE (password)?

SELECT * REPLACE

When you need to transform one column while keeping others, traditional SQL requires listing every column:

Output

Loading Python…

DuckDB’s REPLACE lets you transform just the column you need:

💡 What the output shows

The results are identical. REPLACE modified price_cents while automatically keeping name and stock unchanged.

Try it

Write a query to convert name to uppercase using REPLACE:

💡 Solution

“python result = duckdb.sql("SELECT * REPLACE (UPPER(name) AS name) FROM products").df() “

Quiz

What does SELECT * REPLACE (price / 100 AS price) do?

Streamlined File Reading

DuckDB can query files directly without loading them into memory first. It supports CSV, Parquet, and JSON formats, automatically detecting structure, delimiters, and column types.

💡 What the output shows

The query runs directly on a URL without downloading the file first. DuckDB streams the data and returns aggregated counts per species.

Quiz

What’s the advantage of querying a remote file with DuckDB compared to pd.read_csv(url)?

Query Cloud Storage

DuckDB can query files directly from cloud storage providers like AWS S3, Google Cloud Storage, and Azure Blob Storage without downloading them first.

Query from S3:

Reading files from S3 works just like reading local files. Simply pass an S3 path to read_parquet.

For private buckets, add your credentials once and DuckDB handles authentication automatically.

The same approach works with other cloud providers.

Quiz

What makes querying cloud storage in DuckDB different from traditional approaches?

Automatic Parsing of CSV Files

When working with CSV files that have non-standard delimiters, pandas requires you to specify parameters like delimiter to avoid parsing errors.

With pandas, a pipe-delimited CSV is parsed incorrectly:

Output

Loading Python…

💡 What the output shows

Pandas created a single column containing the entire pipe-separated row as one string. Without specifying delimiter='|', it defaulted to comma separation.

With DuckDB, the delimiter is auto-detected:

💡 What the output shows

DuckDB correctly parsed 4 separate columns (FlightDate, Carrier, Origin, Destination) by auto-detecting the pipe delimiter. No configuration needed.

Try it

Change the delimiter from | to ; and run the query. Does DuckDB still parse it correctly?

Output

Loading Python…

💡 Solution

Change | to ; in the csv_content string:

“python csv_content = """FlightDate;Carrier;Origin;Destination 2024-01-01;AA;NYC;LAX 2024-01-02;UA;SFO;ORD""" “

DuckDB auto-detects semicolons, tabs, pipes, and other common delimiters.

Quiz

How did DuckDB correctly parse the pipe-delimited file?

Automatic Flattening of Nested Parquet Files

When working with large, nested Parquet files, you typically need to pre-process the data to flatten nested structures or write complex extraction scripts, which adds time and complexity to your workflow.

With pandas, you need to manually flatten nested structures:

Output

Loading Python…

💡 What the output shows

The details column contains dictionaries. To access name or age, you’d need to manually flatten with list comprehensions or apply().

With DuckDB, query nested fields directly using dot notation:

💡 What the output shows

DuckDB’s dot notation (details.name, details.age) extracts nested fields directly in SQL. No manual flattening, list comprehensions, or apply() needed.

Try it

Fill in the dot notation to extract color and weight from the specs struct:

Output

Loading Python…

💡 Solution

“python result = duckdb.sql(""" SELECT name, specs.color AS color, specs.weight AS weight FROM read_parquet('/tmp/products.parquet') """).df() “

Quiz

How does DuckDB access nested fields in Parquet files?

Automatic Flattening of Nested JSON Files

When working with JSON files that have nested structures, you need to normalize the data with pandas to access nested fields.

With pandas, you need json_normalize() to flatten nested structures:

Output

Loading Python…

💡 What the output shows

Flattening nested JSON required importing json_normalize(), calling it on your data, then likely renaming the dot-notation columns (profile.name, profile.active) to something cleaner.

With DuckDB, you can query each nested field directly with the syntax field_name.nested_field_name:

💡 What the output shows

DuckDB queries nested JSON fields with the same dot notation as Parquet. No json_normalize() step needed before analysis.

Quiz

In the query above, what does profile.name AS name do?

Reading Multiple Files

Reading Multiple Files from a Directory

Reading multiple files from a directory is common in data pipelines.

First, let’s create some sample CSV files to work with:

Output

Loading Python…

With pandas, you need to read each file separately then concatenate:

💡 What the output shows

Pandas required reading each file into separate DataFrames, then concatenating them. With more files, this approach becomes tedious and error-prone.

With DuckDB, use glob patterns to read all files at once:

💡 What the output shows

Both January and February data appear in a single result from one query. The *.csv glob pattern matched all CSV files in the directory automatically.

Quiz

Why is the glob pattern useful as your data grows?

Read From Multiple Sources

DuckDB allows you to read data from multiple sources in a single query, making it easier to combine data from different sources.

You can mix DataFrames, CSV files, Parquet files, and JSON files seamlessly.

First, let’s create two related DataFrames:

Output

Loading Python…

Now use DuckDB to join and aggregate these DataFrames with a single SQL query:

💡 What the output shows

The query joined two separate DataFrames (customers and orders) with standard SQL syntax. No need to merge DataFrames in pandas first.

Quiz

What pandas operation does DuckDB’s JOIN replace?

Hive Partitioned Datasets

When working with large datasets, it’s common to partition data into separate files by date, region, or other columns. A typical approach is organizing files into folders:

data/
├── 2023/
│   ├── 01/data.parquet
│   └── 02/data.parquet
└── 2024/
    └── 01/data.parquet

With this structure, you lose the partition information when reading. You’d need to extract year and month from file paths manually.

Hive-style partitioning solves this by encoding column values in directory names:

data/
├── year=2023/
│   ├── month=01/data.parquet
│   └── month=02/data.parquet
└── year=2024/
    └── month=01/data.parquet

DuckDB can both write and read Hive-partitioned datasets automatically.

Writing partitioned data:

Use PARTITION_BY to create a Hive-partitioned dataset:

Output

Loading Python…

Verify the directory structure:

💡 What the output shows

DuckDB created directories like year=2023/quarter=Q1/. Each partition folder contains only rows matching those values.

Quiz

What does PARTITION_BY (year, quarter) do when writing data?

Reading partitioned data:

Now read the data back. DuckDB auto-detects the column=value folder pattern:

💡 What the output shows

Notice year and quarter columns are included in the results. DuckDB extracted these from the directory names without any extra code.

Quiz

Why partition large datasets into folders like year=2023/quarter=Q1/?

Exporting Data

DuckDB can export query results directly to CSV, Parquet, and JSON files without converting to pandas first. This avoids memory overhead for large datasets.

Export to CSV:

Use .write_csv() to save query results directly to a CSV file:

Output

Loading Python…

Export to Parquet:

Use .write_parquet() for columnar storage with optional compression:

💡 What the output shows

Both .write_csv() and .write_parquet() export query results directly. No pandas conversion needed.

Using COPY statement:

The COPY statement provides more control over export options:

DELIMITER '|' – Custom delimiters (comma, pipe, tab)
HEADER true – Include or exclude header row
COMPRESSION 'gzip' – Compress output (gzip, zstd, snappy)
DATEFORMAT '%Y-%m-%d' – Custom date formatting
NULL 'NA' – Custom null value representation

Here’s an example using a custom delimiter and header option together.

💡 What the output shows

The output uses pipe (|) delimiters instead of commas, as specified by DELIMITER '|'.

Quiz

What’s the advantage of .write_parquet() over converting to pandas first?

Creating Lists, Structs, and Maps

DuckDB supports rich nested data types that go beyond traditional SQL databases. You can create and manipulate LISTs, STRUCTs, and MAPs directly in SQL.

LIST – ordered collection of values:

Unlike traditional SQL arrays, DuckDB lists use familiar Python-style square brackets.

Quiz

What type of values can a DuckDB list contain?

STRUCT – named fields (like a dictionary):

Structs require the same field names. DuckDB throws an error if fields don’t match.

💡 What the output shows

Both rows have identical fields: name and age. Trying to UNION structs with different field names would throw an error.

MAP – key-value pairs:

Use maps when each row needs different keys. No schema enforcement.

💡 What the output shows

Alice has only steps, Bob has only calories. Maps allow different keys per row without errors.

Quiz

You’re storing product attributes where laptops have RAM and shirts have size. Which type should you use?

Manipulating Nested Data

In traditional SQL, working with nested data requires complex joins or custom functions. DuckDB provides native operations: list indexing, Python-style comprehensions, dot notation for structs, and UNNEST to flatten lists.

Access list elements:

You can access elements using 1-based indexing, negative indices for the end, and slicing.

💡 What the output shows

Unlike Python’s 0-based indexing, DuckDB lists start at 1. The slice [2:4] includes both endpoints.

Try it

Write the slice to extract [20, 30, 40] from the list:

💡 Solution

“python SELECT [10, 20, 30, 40, 50][2:4] AS middle_three “

The slice [2:4] gets elements from index 2 to 4 (inclusive), which are 20, 30, and 40.

Transform lists with list comprehensions:

You can transform lists directly in SQL using Python-style comprehensions.

💡 What the output shows

The doubled column multiplies each element by 2, while filtered keeps only values greater than 2. No Python code needed.

Quiz

How do you filter a list to keep only values greater than 5?

Access struct fields:

DuckDB lets you access struct fields using dot notation, just like object properties.

💡 What the output shows

Dot notation extracts Bob from student.name and 95 from student.score. No need to parse JSON or use special functions.

Quiz

What does {'name': 'Alice', 'score': 90}.score return?

Unnest lists to rows:

The UNNEST function expands list elements into individual rows for row-by-row analysis.

💡 What the output shows

Each tag from the original lists becomes its own row. ID 1 had 2 tags, so it appears twice. ID 2 had 3 tags, so it appears three times.

Quiz

How many rows does SELECT UNNEST([1, 2, 3]) return?

Parameterized Queries

When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.

First, let’s create a sample products table:

You might use f-strings to pass parameters to your queries:

⚠ Caution

While this works, f-strings are dangerous. A malicious user could:

Input "0; DROP TABLE products; --" to delete your table
Input "0 UNION SELECT * FROM secrets" to steal data

DuckDB provides a safer way with parameterized queries using the ? placeholder:

💡 What the output shows

DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.

Try it

Use the ? placeholder to find products under $300:

Output

Loading Python…

💡 Solution

“python "SELECT * FROM products WHERE price < ?", (max_price,) “

The ? placeholder gets replaced with the value from the tuple. The trailing comma is required for single-element tuples.

Quiz

If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?

ACID Transactions

DuckDB supports ACID transactions for data integrity:

Atomicity: The transaction either completes entirely or has no effect at all. If any operation fails, all changes are rolled back.
Consistency: The database maintains valid data by enforcing all rules and constraints throughout the transaction.
Isolation: Transactions run independently without interfering with each other.
Durability: Committed changes are permanent and survive system failures.

Let’s demonstrate ACID properties with a bank transfer. First, set up accounts and create a transfer function with balance checking:

Now define a transfer function that checks the balance before transferring. If funds are insufficient, it rolls back the transaction:

Output

Loading Python…

In the code above:

BEGIN TRANSACTION starts a transaction block where all changes remain hidden until committed
COMMIT permanently saves all changes made within the transaction
ROLLBACK cancels all changes and restores the database to its state before the transaction began

Now let’s perform a valid transfer of $200 from Alice to Bob:

💡 What the output shows

Alice’s balance dropped from $500 to $300, and Bob’s increased from $500 to $700. This demonstrates two ACID properties:

Atomicity: Both updates executed as a single unit
Consistency: Total balance stayed at $1500

Now let’s attempt an invalid transfer. Bob tries to send $1000 to Charlie, but he only has $700:

💡 What the output shows

The transfer failed due to insufficient funds. ROLLBACK canceled all changes, so Bob still has $700 and Charlie still has $500. This demonstrates two more ACID properties:

Atomicity: Either all operations succeed, or none do
Durability: The successful transfer from earlier is permanently saved

Quiz

In the bank transfer example, what does Atomicity guarantee?

Attach External Databases

DuckDB can connect to external databases and query them as if they were local tables. This enables federated queries across PostgreSQL, MySQL, SQLite, and DuckDB files.

Attach a SQLite database:

Querying a SQLite database is straightforward. Just install the extension, attach the file, and run SQL.

Quiz

What operations does DuckDB support on attached SQLite databases?

Attach a PostgreSQL database:

Connecting to PostgreSQL is just as easy. Simply install the extension and provide your connection details.

When querying with filters, DuckDB pushes the WHERE condition to PostgreSQL. Only matching rows travel back over the network, then DuckDB runs aggregations on the received data.

This query shows how DuckDB pushes the WHERE clause to PostgreSQL.

The diagram below illustrates the network optimization:

DuckDB                         PostgreSQL
  │                                │
  │──── WHERE year = 2024 ────────>│
  │                                │ (filters 1M → 10K rows)
  │<─── 10K matching rows ─────────│
  │                                │
  │ (runs SUM, GROUP BY locally)   │

Quiz

When you query a PostgreSQL table with a WHERE clause, what does DuckDB do?

Federated queries – join across databases:

DuckDB’s real power shows when joining across databases. Simply attach each source and reference tables with their database prefix.

💡 What the output shows

A single query joins data from SQLite, PostgreSQL, and a Parquet file. DuckDB handles the complexity of fetching, joining, and aggregating across different sources.

Quiz

To join a SQLite table with a PostgreSQL table in DuckDB, you need to:

Save attached database locally:

To avoid repeated network calls, you can create a local copy of remote tables.

💡 What the output shows

Only the first query hits the network. The three subsequent queries run entirely on local data, avoiding repeated round trips to PostgreSQL.

Key Takeaways

DuckDB empowers data scientists with several key advantages:

Zero-configuration SQL querying without database server setup
Seamless integration with pandas and Polars DataFrames
Out-of-core processing for datasets larger than available RAM
Friendly SQL syntax with shortcuts like GROUP BY ALL and EXCLUDE
Direct querying of cloud storage (S3, Azure, GCS)
Reading and writing Hive-partitioned datasets
Rich nested data types (LIST, STRUCT, MAP) with list comprehensions
Federated queries across PostgreSQL, MySQL, and SQLite
ACID transaction support ensuring data integrity
High-performance columnar-vectorized execution engine

With DuckDB, you can unlock a new level of productivity and efficiency in your data analysis workflows.

💡 Next steps

Try DuckDB on your own datasets! Start by replacing pandas queries with DuckDB SQL and compare the performance.

What is DuckDB?

Installation

Zero Configuration

Try it

Quiz

Integrate Seamlessly with pandas and Polars

Try it

Quiz

Memory Efficiency

Quiz

Out-of-Core Processing

Quiz

Fast Performance

Quiz

FROM-First Syntax

Try it

Quiz

GROUP BY ALL

Try it

Quiz

SELECT * EXCLUDE

Try it

Quiz

SELECT * REPLACE

Try it

Quiz

Streamlined File Reading

Quiz

Query Cloud Storage

Quiz

Automatic Parsing of CSV Files

Try it

Quiz

Automatic Flattening of Nested Parquet Files

Try it

Quiz

Automatic Flattening of Nested JSON Files

Quiz

Reading Multiple Files

Reading Multiple Files from a Directory

Quiz

Read From Multiple Sources

Quiz

Hive Partitioned Datasets

Quiz

Quiz

Exporting Data

Quiz

Creating Lists, Structs, and Maps

Quiz

Quiz

Manipulating Nested Data

Try it

Quiz

Quiz

Quiz

Parameterized Queries

Try it

Quiz

ACID Transactions

Quiz

Attach External Databases

Quiz

Quiz

Quiz

Key Takeaways

Course Complete!

Work with Khuyen Tran

Work with Khuyen Tran