What is DuckDB?
DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.
It’s perfect for data scientists because:
- Zero Configuration: No database server setup required
- Memory Efficiency: Out-of-core processing for datasets larger than RAM
- Familiar Interface: SQL syntax with shortcuts like
GROUP BY ALL - Performance: Columnar-vectorized engine faster than pandas
- Universal Access: Query files, cloud storage, and external databases
Installation
Install DuckDB with pip:
pip install duckdbLet’s verify the installation:
Zero Configuration
SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:
- Install and configure a database server
- Ensure the service is running
- Set up credentials and connections
- Write the DataFrame to a table first
DuckDB eliminates this overhead. Query DataFrames directly with SQL:
Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.
Try it
Edit the query to select items with quantity greater than 30 from the inventory DataFrame:
💡 Solution
“python result = duckdb.sql("SELECT * FROM inventory WHERE quantity > 30").df() “
Quiz
In the code above, how does DuckDB access the sales DataFrame?
Integrate Seamlessly with pandas and Polars
Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?
DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.
DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.
DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:
- Use pandas for data cleaning and feature engineering
- Use DuckDB for complex aggregations and complex queries
pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.
Now use DuckDB for complex aggregations:
DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.
Try it
Edit the query to combine results from both df_2023 and df_2024 using UNION ALL:
💡 Solution
“python result = duckdb.sql("SELECT * FROM df_2023 UNION ALL SELECT * FROM df_2024").df() “
Quiz
What makes DuckDB’s approach to complex aggregations more readable than pandas?
Memory Efficiency
Pandas loads entire datasets into RAM before filtering, which can cause out-of-memory errors. DuckDB processes only the rows that match your filter, using a fraction of the memory. To see this in action, let’s compare both approaches on the same dataset.
First, create a sample CSV file:
With pandas, filtering loads ALL records into RAM first:
With DuckDB, only matching rows enter memory:
The diagram below summarizes the memory difference:
RAM Usage
│
│ ████████████ Pandas (loads all 1M rows)
│
│ ██ DuckDB (streams, keeps 10K matches)
│
└──────────────────────────────────────────────Quiz
What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?
Out-of-Core Processing
When your dataset exceeds available RAM, pandas requires workarounds like chunking or Dask. DuckDB solves this with out-of-core processing:
- Breaks data into chunks that fit in memory
- Processes each chunk separately
- Stores intermediate results on disk
- Merges results into the final output
Dataset (50GB) Memory (16GB) Disk (temp files)
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Chunk 1 │ ────────> │ Process │ ─────────> │ result_1.tmp │
│ Chunk 2 │ │ in RAM │ │ result_2.tmp │
│ Chunk 3 │ └──────────┘ │ result_3.tmp │
│ ... │ └──────┬───────┘
└──────────┘ │ merge
▼
┌─────────────┐
│ Final Result│
└─────────────┘The following example demonstrates this process. We’ll limit DuckDB to 10MB of memory and sort 5 million rows:
The query succeeds despite processing 5 million rows with only 10MB of memory. If you check your temp directory while the query runs, you’ll see temporary files being created. These files are automatically deleted once the query completes.
You can also configure where DuckDB stores temporary files:
Quiz
What does “out-of-core processing” mean?
When running the example above, what happens to the temporary files after the query completes?
Fast Performance
While pandas processes data sequentially row-by-row, DuckDB uses a columnar-vectorized execution engine that processes data in parallel chunks. The diagram below shows how each approach handles data:
Pandas DuckDB
│ │
├─ Row 1 ──────> process ├─ Chunk 1 (2048 rows) ─┐
├─ Row 2 ──────> process ├─ Chunk 2 (2048 rows) ─┼─> process
├─ Row 3 ──────> process ├─ Chunk 3 (2048 rows) ─┘
├─ Row 4 ──────> process │
│ ... │
▼ ▼
Sequential Parallel chunksThis architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.
Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.
DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s columnar-vectorized execution engine processing data in parallel chunks.
The following benchmark was run on native Python. Results may vary in browser-based environments.
Quiz
How does pandas process data differently from DuckDB?
FROM-First Syntax
Traditional SQL requires SELECT before FROM. This adds unnecessary boilerplate when you just want a quick look at your data:
DuckDB lets you skip SELECT * entirely, making quick data exploration faster:
Notice the results are the same. This confirms that FROM table automatically selects all columns.
Try it
Write a FROM-first query to get all sales with amount > 150:
💡 Solution
“python result = duckdb.sql("FROM sales WHERE amount > 150").df() “
Quiz
What happens when you run FROM sales in DuckDB?
GROUP BY ALL
When using GROUP BY, you typically repeat every non-aggregated column:
DuckDB infers grouping columns automatically with GROUP BY ALL:
Both queries produce the same result. GROUP BY ALL automatically detects product and region as grouping columns, so you don’t have to list them twice.
Try it
Rewrite this query using GROUP BY ALL instead of listing columns:
💡 Solution
“python result = duckdb.sql(""" SELECT region, COUNT(*) as count, AVG(amount) as avg_amount FROM sales GROUP BY ALL """).df() “
Quiz
How does GROUP BY ALL determine which columns to group by?
SELECT * EXCLUDE
When you need all columns except a few, traditional SQL requires listing every column you want:
DuckDB’s EXCLUDE lets you specify what to remove instead:
The results are identical. EXCLUDE is shorter to write and stays correct even if you add new columns to the table later.
You can also exclude multiple columns at once with a comma-separated list:
Try it
Write a query to get all columns except email from the users table:
💡 Solution
“python result = duckdb.sql("SELECT * EXCLUDE (email) FROM users").df() “
Quiz
What happens when you add a new column to a table queried with SELECT * EXCLUDE (password)?
SELECT * REPLACE
When you need to transform one column while keeping others, traditional SQL requires listing every column:
DuckDB’s REPLACE lets you transform just the column you need:
The results are identical. REPLACE modified price_cents while automatically keeping name and stock unchanged.
Try it
Write a query to convert name to uppercase using REPLACE:
💡 Solution
“python result = duckdb.sql("SELECT * REPLACE (UPPER(name) AS name) FROM products").df() “
Quiz
What does SELECT * REPLACE (price / 100 AS price) do?
Streamlined File Reading
DuckDB can query files directly without loading them into memory first. It supports CSV, Parquet, and JSON formats, automatically detecting structure, delimiters, and column types.
The query runs directly on a URL without downloading the file first. DuckDB streams the data and returns aggregated counts per species.
Quiz
What’s the advantage of querying a remote file with DuckDB compared to pd.read_csv(url)?
Query Cloud Storage
DuckDB can query files directly from cloud storage providers like AWS S3, Google Cloud Storage, and Azure Blob Storage without downloading them first.
Query from S3:
Reading files from S3 works just like reading local files. Simply pass an S3 path to read_parquet.
For private buckets, add your credentials once and DuckDB handles authentication automatically.
The same approach works with other cloud providers.
Quiz
What makes querying cloud storage in DuckDB different from traditional approaches?
Automatic Parsing of CSV Files
When working with CSV files that have non-standard delimiters, pandas requires you to specify parameters like delimiter to avoid parsing errors.
With pandas, a pipe-delimited CSV is parsed incorrectly:
Pandas created a single column containing the entire pipe-separated row as one string. Without specifying delimiter='|', it defaulted to comma separation.
With DuckDB, the delimiter is auto-detected:
DuckDB correctly parsed 4 separate columns (FlightDate, Carrier, Origin, Destination) by auto-detecting the pipe delimiter. No configuration needed.
Try it
Change the delimiter from | to ; and run the query. Does DuckDB still parse it correctly?
💡 Solution
Change | to ; in the csv_content string:
“python csv_content = """FlightDate;Carrier;Origin;Destination 2024-01-01;AA;NYC;LAX 2024-01-02;UA;SFO;ORD""" “
DuckDB auto-detects semicolons, tabs, pipes, and other common delimiters.
Quiz
How did DuckDB correctly parse the pipe-delimited file?
Automatic Flattening of Nested Parquet Files
When working with large, nested Parquet files, you typically need to pre-process the data to flatten nested structures or write complex extraction scripts, which adds time and complexity to your workflow.
With pandas, you need to manually flatten nested structures:
The details column contains dictionaries. To access name or age, you’d need to manually flatten with list comprehensions or apply().
With DuckDB, query nested fields directly using dot notation:
DuckDB’s dot notation (details.name, details.age) extracts nested fields directly in SQL. No manual flattening, list comprehensions, or apply() needed.
Try it
Fill in the dot notation to extract color and weight from the specs struct:
💡 Solution
“python result = duckdb.sql(""" SELECT name, specs.color AS color, specs.weight AS weight FROM read_parquet('/tmp/products.parquet') """).df() “
Quiz
How does DuckDB access nested fields in Parquet files?
Automatic Flattening of Nested JSON Files
When working with JSON files that have nested structures, you need to normalize the data with pandas to access nested fields.
With pandas, you need json_normalize() to flatten nested structures:
Flattening nested JSON required importing json_normalize(), calling it on your data, then likely renaming the dot-notation columns (profile.name, profile.active) to something cleaner.
With DuckDB, you can query each nested field directly with the syntax field_name.nested_field_name:
DuckDB queries nested JSON fields with the same dot notation as Parquet. No json_normalize() step needed before analysis.
Quiz
In the query above, what does profile.name AS name do?
Reading Multiple Files
Reading Multiple Files from a Directory
Reading multiple files from a directory is common in data pipelines.
First, let’s create some sample CSV files to work with:
With pandas, you need to read each file separately then concatenate:
Pandas required reading each file into separate DataFrames, then concatenating them. With more files, this approach becomes tedious and error-prone.
With DuckDB, use glob patterns to read all files at once:
Both January and February data appear in a single result from one query. The *.csv glob pattern matched all CSV files in the directory automatically.
Quiz
Why is the glob pattern useful as your data grows?
Read From Multiple Sources
DuckDB allows you to read data from multiple sources in a single query, making it easier to combine data from different sources.
You can mix DataFrames, CSV files, Parquet files, and JSON files seamlessly.
First, let’s create two related DataFrames:
Now use DuckDB to join and aggregate these DataFrames with a single SQL query:
The query joined two separate DataFrames (customers and orders) with standard SQL syntax. No need to merge DataFrames in pandas first.
Quiz
What pandas operation does DuckDB’s JOIN replace?
Hive Partitioned Datasets
When working with large datasets, it’s common to partition data into separate files by date, region, or other columns. A typical approach is organizing files into folders:
data/
├── 2023/
│ ├── 01/data.parquet
│ └── 02/data.parquet
└── 2024/
└── 01/data.parquetWith this structure, you lose the partition information when reading. You’d need to extract year and month from file paths manually.
Hive-style partitioning solves this by encoding column values in directory names:
data/
├── year=2023/
│ ├── month=01/data.parquet
│ └── month=02/data.parquet
└── year=2024/
└── month=01/data.parquetDuckDB can both write and read Hive-partitioned datasets automatically.
Writing partitioned data:
Use PARTITION_BY to create a Hive-partitioned dataset:
Verify the directory structure:
DuckDB created directories like year=2023/quarter=Q1/. Each partition folder contains only rows matching those values.
Quiz
What does PARTITION_BY (year, quarter) do when writing data?
Reading partitioned data:
Now read the data back. DuckDB auto-detects the column=value folder pattern:
Notice year and quarter columns are included in the results. DuckDB extracted these from the directory names without any extra code.
Quiz
Why partition large datasets into folders like year=2023/quarter=Q1/?
Exporting Data
DuckDB can export query results directly to CSV, Parquet, and JSON files without converting to pandas first. This avoids memory overhead for large datasets.
Export to CSV:
Use .write_csv() to save query results directly to a CSV file:
Export to Parquet:
Use .write_parquet() for columnar storage with optional compression:
Both .write_csv() and .write_parquet() export query results directly. No pandas conversion needed.
Using COPY statement:
The COPY statement provides more control over export options:
DELIMITER '|'– Custom delimiters (comma, pipe, tab)HEADER true– Include or exclude header rowCOMPRESSION 'gzip'– Compress output (gzip, zstd, snappy)DATEFORMAT '%Y-%m-%d'– Custom date formattingNULL 'NA'– Custom null value representation
Here’s an example using a custom delimiter and header option together.
The output uses pipe (|) delimiters instead of commas, as specified by DELIMITER '|'.
Quiz
What’s the advantage of .write_parquet() over converting to pandas first?
Creating Lists, Structs, and Maps
DuckDB supports rich nested data types that go beyond traditional SQL databases. You can create and manipulate LISTs, STRUCTs, and MAPs directly in SQL.
LIST – ordered collection of values:
Unlike traditional SQL arrays, DuckDB lists use familiar Python-style square brackets.
Quiz
What type of values can a DuckDB list contain?
STRUCT – named fields (like a dictionary):
Structs require the same field names. DuckDB throws an error if fields don’t match.
Both rows have identical fields: name and age. Trying to UNION structs with different field names would throw an error.
MAP – key-value pairs:
Use maps when each row needs different keys. No schema enforcement.
Alice has only steps, Bob has only calories. Maps allow different keys per row without errors.
Quiz
You’re storing product attributes where laptops have RAM and shirts have size. Which type should you use?
Manipulating Nested Data
In traditional SQL, working with nested data requires complex joins or custom functions. DuckDB provides native operations: list indexing, Python-style comprehensions, dot notation for structs, and UNNEST to flatten lists.
Access list elements:
You can access elements using 1-based indexing, negative indices for the end, and slicing.
Unlike Python’s 0-based indexing, DuckDB lists start at 1. The slice [2:4] includes both endpoints.
Try it
Write the slice to extract [20, 30, 40] from the list:
💡 Solution
“python SELECT [10, 20, 30, 40, 50][2:4] AS middle_three “
The slice [2:4] gets elements from index 2 to 4 (inclusive), which are 20, 30, and 40.
Transform lists with list comprehensions:
You can transform lists directly in SQL using Python-style comprehensions.
The doubled column multiplies each element by 2, while filtered keeps only values greater than 2. No Python code needed.
Quiz
How do you filter a list to keep only values greater than 5?
Access struct fields:
DuckDB lets you access struct fields using dot notation, just like object properties.
Dot notation extracts Bob from student.name and 95 from student.score. No need to parse JSON or use special functions.
Quiz
What does {'name': 'Alice', 'score': 90}.score return?
Unnest lists to rows:
The UNNEST function expands list elements into individual rows for row-by-row analysis.
Each tag from the original lists becomes its own row. ID 1 had 2 tags, so it appears twice. ID 2 had 3 tags, so it appears three times.
Quiz
How many rows does SELECT UNNEST([1, 2, 3]) return?
Parameterized Queries
When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.
First, let’s create a sample products table:
You might use f-strings to pass parameters to your queries:
While this works, f-strings are dangerous. A malicious user could:
- Input
"0; DROP TABLE products; --"to delete your table - Input
"0 UNION SELECT * FROM secrets"to steal data
DuckDB provides a safer way with parameterized queries using the ? placeholder:
DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.
Try it
Use the ? placeholder to find products under $300:
💡 Solution
“python "SELECT * FROM products WHERE price < ?", (max_price,) “
The ? placeholder gets replaced with the value from the tuple. The trailing comma is required for single-element tuples.
Quiz
If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?
ACID Transactions
DuckDB supports ACID transactions for data integrity:
- Atomicity: The transaction either completes entirely or has no effect at all. If any operation fails, all changes are rolled back.
- Consistency: The database maintains valid data by enforcing all rules and constraints throughout the transaction.
- Isolation: Transactions run independently without interfering with each other.
- Durability: Committed changes are permanent and survive system failures.
Let’s demonstrate ACID properties with a bank transfer. First, set up accounts and create a transfer function with balance checking:
Now define a transfer function that checks the balance before transferring. If funds are insufficient, it rolls back the transaction:
In the code above:
BEGIN TRANSACTIONstarts a transaction block where all changes remain hidden until committedCOMMITpermanently saves all changes made within the transactionROLLBACKcancels all changes and restores the database to its state before the transaction began
Now let’s perform a valid transfer of $200 from Alice to Bob:
Alice’s balance dropped from $500 to $300, and Bob’s increased from $500 to $700. This demonstrates two ACID properties:
- Atomicity: Both updates executed as a single unit
- Consistency: Total balance stayed at $1500
Now let’s attempt an invalid transfer. Bob tries to send $1000 to Charlie, but he only has $700:
The transfer failed due to insufficient funds. ROLLBACK canceled all changes, so Bob still has $700 and Charlie still has $500. This demonstrates two more ACID properties:
- Atomicity: Either all operations succeed, or none do
- Durability: The successful transfer from earlier is permanently saved
Quiz
In the bank transfer example, what does Atomicity guarantee?
Attach External Databases
DuckDB can connect to external databases and query them as if they were local tables. This enables federated queries across PostgreSQL, MySQL, SQLite, and DuckDB files.
Attach a SQLite database:
Querying a SQLite database is straightforward. Just install the extension, attach the file, and run SQL.
Quiz
What operations does DuckDB support on attached SQLite databases?
Attach a PostgreSQL database:
Connecting to PostgreSQL is just as easy. Simply install the extension and provide your connection details.
When querying with filters, DuckDB pushes the WHERE condition to PostgreSQL. Only matching rows travel back over the network, then DuckDB runs aggregations on the received data.
This query shows how DuckDB pushes the WHERE clause to PostgreSQL.
The diagram below illustrates the network optimization:
DuckDB PostgreSQL
│ │
│──── WHERE year = 2024 ────────>│
│ │ (filters 1M → 10K rows)
│<─── 10K matching rows ─────────│
│ │
│ (runs SUM, GROUP BY locally) │Quiz
When you query a PostgreSQL table with a WHERE clause, what does DuckDB do?
Federated queries – join across databases:
DuckDB’s real power shows when joining across databases. Simply attach each source and reference tables with their database prefix.
A single query joins data from SQLite, PostgreSQL, and a Parquet file. DuckDB handles the complexity of fetching, joining, and aggregating across different sources.
Quiz
To join a SQLite table with a PostgreSQL table in DuckDB, you need to:
Save attached database locally:
To avoid repeated network calls, you can create a local copy of remote tables.
Only the first query hits the network. The three subsequent queries run entirely on local data, avoiding repeated round trips to PostgreSQL.
Key Takeaways
DuckDB empowers data scientists with several key advantages:
- Zero-configuration SQL querying without database server setup
- Seamless integration with pandas and Polars DataFrames
- Out-of-core processing for datasets larger than available RAM
- Friendly SQL syntax with shortcuts like
GROUP BY ALLandEXCLUDE - Direct querying of cloud storage (S3, Azure, GCS)
- Reading and writing Hive-partitioned datasets
- Rich nested data types (LIST, STRUCT, MAP) with list comprehensions
- Federated queries across PostgreSQL, MySQL, and SQLite
- ACID transaction support ensuring data integrity
- High-performance columnar-vectorized execution engine
With DuckDB, you can unlock a new level of productivity and efficiency in your data analysis workflows.
Try DuckDB on your own datasets! Start by replacing pandas queries with DuckDB SQL and compare the performance.