What is DuckDB?
DuckDB is a fast, in-process SQL OLAP database optimized for analytics. Unlike traditional databases like PostgreSQL or MySQL that require server setup and maintenance, DuckDB runs directly in your Python process.
It’s perfect for data scientists because:
- Zero Configuration: No database server setup required
- Memory Efficiency: Process large datasets without loading everything into memory
- Familiar Interface: Use SQL syntax directly in Python
- Performance: Faster than pandas for joins and aggregations
- File Format Support: Query CSV, Parquet, and JSON directly
Installation
Install DuckDB with pip:
pip install duckdb Let’s verify the installation:
Zero Configuration
SQL operations on DataFrames typically require setting up database servers. With pandas and PostgreSQL, you need to:
- Install and configure a database server
- Ensure the service is running
- Set up credentials and connections
- Write the DataFrame to a table first
DuckDB eliminates this overhead. Query DataFrames directly with SQL:
Notice how the query returns results instantly. There’s no connection string, no server startup time, and no authentication steps.
In the code above, how does DuckDB access the sales DataFrame?
Integrate Seamlessly with pandas and Polars
Have you ever wanted to leverage SQL’s power while working with your favorite data manipulation libraries such as pandas and Polars?
DuckDB makes it seamless to query pandas and Polars DataFrames via the duckdb.sql function.
Query pandas DataFrame:
a b
0 1 4
1 2 5
2 3 6
Query Polars DataFrame:
a b
0 1 4
1 2 5
2 3 6 DuckDB recognized both pd_df (pandas) and pl_df (Polars) as DataFrame variables and queried them directly with SQL. No imports or registration needed.
DuckDB’s integration with pandas and Polars lets you combine the strengths of each tool. For example, you can:
- Use pandas for data cleaning and feature engineering
- Use DuckDB for complex aggregations and complex queries
pandas makes feature engineering straightforward: extracting month from dates and creating is_high_value flags are common transformations for preparing data for analysis or machine learning.
Now use DuckDB for complex aggregations:
DuckDB excels at complex aggregations: combining GROUP BY, AVG, and conditional CASE WHEN in a single query is more readable and efficient than equivalent pandas code.
What makes DuckDB’s approach to complex aggregations more readable than pandas?
Memory Efficiency
A major drawback of pandas is its in-memory processing requirement. It must load complete datasets into RAM before any operations can begin, which can trigger out-of-memory errors when analyzing large datasets.
With pandas, filtering for customers in the ‘North’ region loads ALL records into RAM first:
The message shows pandas loaded all 1000 rows into memory just to filter down to 250 North customers. For larger datasets, this memory overhead becomes problematic.
With DuckDB, only the matching rows are processed:
DuckDB’s query optimizer pushes filters down, processing only rows matching region = 'North'. This avoids the memory overhead of loading all 1000 rows before filtering.
What’s the key difference between how pandas and DuckDB handle the filter region = 'North'?
Fast Performance
While pandas processes data sequentially row-by-row, DuckDB uses a vectorized execution engine that processes data in parallel chunks. This architectural difference enables DuckDB to significantly outperform pandas, especially for computationally intensive operations like aggregations and joins.
Let’s compare the performance of pandas and DuckDB for aggregations on a million rows of data.
Pandas aggregation time: 0.13 seconds
DuckDB aggregation time: 0.02 seconds
Speedup: 8.7x DuckDB completes the same aggregation ~8x faster than pandas. The speedup comes from DuckDB’s vectorized execution engine processing data in parallel chunks.
The following benchmark was run on native Python. Results may vary in browser-based environments.
How does pandas process data differently from DuckDB?
Streamlined File Reading
DuckDB can query files without loading them into memory first. It automatically detects file formats and structures, including delimiters, headers, and column types.
CSV files:
region count
0 North 250
1 South 250
2 East 250
3 West 250 The query runs directly on a URL without downloading the file first. DuckDB streams the CSV data and returns aggregated counts per region.
How does DuckDB handle the remote CSV file in this query?
Automatic Parsing of CSV Files
When working with CSV files that have non-standard delimiters, pandas requires you to specify parameters like delimiter to avoid parsing errors.
With pandas, a pipe-delimited CSV is parsed incorrectly:
Pandas created a single column containing the entire pipe-separated row as one string. Without specifying delimiter='|', it defaulted to comma separation.
With DuckDB, the delimiter is auto-detected:
DuckDB correctly parsed 4 separate columns (FlightDate, Carrier, Origin, Destination) by auto-detecting the pipe delimiter. No configuration needed.
How did DuckDB correctly parse the pipe-delimited file?
Automatic Flattening of Nested Parquet Files
When working with large, nested Parquet files, you typically need to pre-process the data to flatten nested structures or write complex extraction scripts, which adds time and complexity to your workflow.
With pandas, you need to manually flatten nested structures:
The details column contains dictionaries. To access name or age, you’d need to manually flatten with list comprehensions or apply().
With DuckDB, query nested fields directly using dot notation:
DuckDB (flattened with dot notation):
id name age
0 1 Alice 25
1 2 Bob 30 DuckDB’s dot notation (details.name, details.age) extracts nested fields directly in SQL. No manual flattening, list comprehensions, or apply() needed.
How does DuckDB access nested fields in Parquet files?
Automatic Flattening of Nested JSON Files
When working with JSON files that have nested structures, you need to normalize the data with pandas to access nested fields.
With pandas, you need json_normalize() to flatten nested structures:
Flattening nested JSON required importing json_normalize(), calling it on your data, then likely renaming the dot-notation columns (profile.name, profile.active) to something cleaner.
With DuckDB, you can query each nested field directly with the syntax field_name.nested_field_name:
DuckDB queries nested JSON fields with the same dot notation as Parquet. No json_normalize() step needed before analysis.
In the query above, what does profile.name AS name do?
Reading Multiple Files
Reading Multiple Files from a Directory
Reading multiple files from a directory is common in data pipelines.
First, let’s create some sample CSV files to work with:
With pandas, you need to read each file separately then concatenate:
Pandas required reading each file into separate DataFrames, then concatenating them. With more files, this approach becomes tedious and error-prone.
With DuckDB, use glob patterns to read all files at once:
Both January and February data appear in a single result from one query. The *.csv glob pattern matched all CSV files in the directory automatically.
Why is the glob pattern useful as your data grows?
Read From Multiple Sources
DuckDB allows you to read data from multiple sources in a single query, making it easier to combine data from different sources.
You can mix DataFrames, CSV files, Parquet files, and JSON files seamlessly.
First, let’s create two related DataFrames:
Now use DuckDB to join and aggregate these DataFrames with a single SQL query:
The query joined two separate DataFrames (customers and orders) with standard SQL syntax. No need to merge DataFrames in pandas first.
What pandas operation does DuckDB’s JOIN replace?
Parameterized Queries
When working with databases, you often need to run similar queries with different parameters. For instance, you might want to filter a table using various criteria.
First, let’s create a sample products table:
You might use f-strings to pass parameters to your queries:
While this works, f-strings are dangerous. A malicious user could:
- Input
"0; DROP TABLE products; --"to delete your table - Input
"0 UNION SELECT * FROM secrets"to steal data
DuckDB provides a safer way with parameterized queries using the ? placeholder:
DuckDB binds 400 to the ? placeholder separately from parsing. Even if min_price contained malicious SQL, it would be treated as a literal value. This makes your database immune to injection attacks.
If a malicious user sets min_price = "0; DROP TABLE products", what happens with parameterized queries?
ACID Transactions
DuckDB supports ACID transactions for data integrity:
- Atomicity: The transaction either completes entirely or has no effect at all. If any operation fails, all changes are rolled back.
- Consistency: The database maintains valid data by enforcing all rules and constraints throughout the transaction.
- Isolation: Transactions run independently without interfering with each other.
- Durability: Committed changes are permanent and survive system failures.
Let’s demonstrate ACID properties with a bank transfer. First, set up accounts and create a transfer function with balance checking:
Now define a transfer function that checks the balance before transferring. If funds are insufficient, it rolls back the transaction:
In the code above:
BEGIN TRANSACTIONstarts a transaction block where all changes remain hidden until committedCOMMITpermanently saves all changes made within the transactionROLLBACKcancels all changes and restores the database to its state before the transaction began
Now let’s perform a valid transfer of $200 from Alice to Bob:
Alice’s balance dropped from $500 to $300, and Bob’s increased from $500 to $700. This demonstrates two ACID properties:
- Atomicity: Both updates executed as a single unit
- Consistency: Total balance stayed at $1500
Now let’s attempt an invalid transfer. Bob tries to send $1000 to Charlie, but he only has $700:
The transfer failed due to insufficient funds. ROLLBACK canceled all changes, so Bob still has $700 and Charlie still has $500. This demonstrates two more ACID properties:
- Atomicity: Either all operations succeed, or none do
- Durability: The successful transfer from earlier is permanently saved
In the bank transfer example, what does Atomicity guarantee?
Key Takeaways
DuckDB empowers data scientists with several key advantages:
- Zero-configuration SQL querying without database server setup
- Seamless integration with pandas and Polars DataFrames
- Efficient memory management for large-scale data processing
- High-performance execution of complex operations including joins and aggregations
- ACID transaction support ensuring data integrity
- Extensible architecture with a comprehensive extension ecosystem
With DuckDB, you can unlock a new level of productivity and efficiency in your data analysis workflows.
Try DuckDB on your own datasets! Start by replacing pandas queries with DuckDB SQL and compare the performance.