DuckDB: Simplify DataFrame Analysis with Serverless SQL

Khuyen Tran

Motivation

SQL operations on data frames are crucial for data analysis workflows. Data scientists and analysts need to perform complex analytical queries without the overhead of managing traditional database systems.

When working with traditional database systems, analysts need to set up and maintain separate database servers, which adds complexity and overhead to their analytical workflows.

import pandas as pd

# Create sample data
df = pd.DataFrame({
    'customer_id': range(1000),
    'revenue': range(1000),
    'segment': ['A', 'B'] * 500
})

# Traditional approach with PostgreSQL
import psycopg2
from sqlalchemy import create_engine

# Create a connection to PostgreSQL
engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')

# Write the DataFrame to a PostgreSQL table
df.to_sql('customers', engine, if_exists='replace', index=False)

# Execute SQL query against the PostgreSQL database
with engine.connect() as conn:
    result = pd.read_sql("""
        SELECT 
            segment,
            AVG(revenue) as avg_revenue,
            COUNT(*) as customer_count
        FROM customers
        GROUP BY segment
        ORDER BY avg_revenue DESC
    """, conn)

print("Segment Analysis:")
print(result)

Output:

Segment Analysis:
  segment  avg_revenue  customer_count
0       B        500.0             500
1       A        499.0             500

This approach requires:

PostgreSQL server installation and setup
Database credentials management
Connection string configuration

Introduction to DuckDB

DuckDB is a high-performance analytical database system that provides SQL functionality directly on pandas DataFrames without requiring server setup.

As covered in the article about CSV data management, it simplifies data import processes, and based on the article about optimized aggregation, it provides significant performance improvements over pandas for complex calculations.

Install it using:

pip install duckdb

SQL Operations on DataFrames

Here’s how to perform the same analysis with DuckDB, requiring no server setup:

import duckdb

# Direct SQL operations on DataFrame - no server needed
result = duckdb.sql("""
    SELECT 
        segment,
        AVG(revenue) as avg_revenue,
        COUNT(*) as customer_count
    FROM df
    GROUP BY segment
    ORDER BY avg_revenue DESC
""").df()

print("Segment Analysis:")
print(result)

Output:

Segment Analysis:
  segment  avg_revenue  customer_count
0       B        500.0             500
1       A        499.0             500

Conclusion

DuckDB transforms complex DataFrame operations into efficient SQL queries without the overhead of traditional database systems, making it an ideal tool for data analysts and scientists working with large datasets.

Link to DuckDB