Streamline Pattern-Based CSV Processing with DuckDB SQL

Khuyen Tran

Motivation

Data analysts frequently need to aggregate data across multiple CSV files for reporting and analysis. Traditional approaches require complex file handling and multiple data processing steps, which can impact productivity and code maintainability.

Here’s a typical approach that illustrates the complexity:

import glob
import pandas as pd

# Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

# Perform grouping and aggregation
result = (
    combined_df.groupby("Date")["Sales"]
    .sum()
    .reset_index()
    .rename(columns={"Sales": "total_sales"})
    .sort_values("Date")
)
result

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

This approach requires multiple steps and explicit handling of data structures.

Introduction to DuckDB

DuckDB is an analytics-focused database system that runs directly in your application process. It excels at handling analytical queries and provides seamless integration with Python data tools.

Installation is straightforward:

pip install duckdb

As covered in our article about SQL operations on DataFrames, DuckDB provides efficient SQL functionality for data analysis. We’ve also previously discussed CSV data management with DuckDB.

Pattern-Based CSV Processing

DuckDB simplifies multiple CSV file processing through SQL pattern matching. Here’s how to use it:

First, let’s create sample data:

from pathlib import Path
import pandas as pd

# Create sample data
df1 = pd.DataFrame({
    "date": ["2023-01-01", "2023-01-02"],
    "sales": [100, 200]
})

df2 = pd.DataFrame({
    "date": ["2023-01-03", "2023-01-04"],
    "sales": [300, 400]
})

# Save files
Path("data").mkdir(exist_ok=True)
df1.to_csv("data/sales_1.csv", index=False)
df2.to_csv("data/sales_2.csv", index=False)

Process all CSV files with a single SQL query:

import duckdb

# Read and analyze all CSV files at once
result = duckdb.sql("""
    SELECT 
        date,
        SUM(sales) as total_sales
    FROM 'data/*.csv'
    GROUP BY date
    ORDER BY date
""").df()

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

In this code, DuckDB:

Uses SQL pattern matching to find relevant files
Automatically combines matching files
Processes data efficiently in a single pass
Maintains type consistency across files

Conclusion

DuckDB’s pattern-based CSV processing significantly simplifies working with multiple files, making it an excellent choice for data professionals who need to analyze data across multiple CSV files efficiently.

Link to DuckDB.

Automate CSV Parsing with DuckDB’s read_csv

April 9, 2025

Query Nested Parquet Files Easily Using DuckDB

March 30, 2025

DuckDB: Simplify DataFrame Analysis with Serverless SQL

March 9, 2025

Streamline Pattern-Based CSV Processing with DuckDB SQL

Table of Contents

Streamline Pattern-Based CSV Processing with DuckDB SQL

Khuyen Tran

Motivation

Introduction to DuckDB

Pattern-Based CSV Processing

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Streamline Pattern-Based CSV Processing with DuckDB SQL

Table of Contents

Streamline Pattern-Based CSV Processing with DuckDB SQL

Khuyen Tran

Motivation

Introduction to DuckDB

Pattern-Based CSV Processing

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut