Streamline Pattern-Based CSV Processing with DuckDB SQL

Motivation

Data analysts frequently need to aggregate data across multiple CSV files for reporting and analysis. Traditional approaches require complex file handling and multiple data processing steps, which can impact productivity and code maintainability.

Here’s a typical approach that illustrates the complexity:

import glob
import pandas as pd

# Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

# Perform grouping and aggregation
result = (
    combined_df.groupby("Date")["Sales"]
    .sum()
    .reset_index()
    .rename(columns={"Sales": "total_sales"})
    .sort_values("Date")
)
result

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

This approach requires multiple steps and explicit handling of data structures.

Introduction to DuckDB

DuckDB is an analytics-focused database system that runs directly in your application process. It excels at handling analytical queries and provides seamless integration with Python data tools.

Installation is straightforward:

pip install duckdb

As covered in our article about SQL operations on DataFrames, DuckDB provides efficient SQL functionality for data analysis. We’ve also previously discussed CSV data management with DuckDB.

Pattern-Based CSV Processing

DuckDB simplifies multiple CSV file processing through SQL pattern matching. Here’s how to use it:

First, let’s create sample data:

from pathlib import Path
import pandas as pd

# Create sample data
df1 = pd.DataFrame({
    "date": ["2023-01-01", "2023-01-02"],
    "sales": [100, 200]
})

df2 = pd.DataFrame({
    "date": ["2023-01-03", "2023-01-04"],
    "sales": [300, 400]
})

# Save files
Path("data").mkdir(exist_ok=True)
df1.to_csv("data/sales_1.csv", index=False)
df2.to_csv("data/sales_2.csv", index=False)

Process all CSV files with a single SQL query:

import duckdb

# Read and analyze all CSV files at once
result = duckdb.sql("""
    SELECT 
        date,
        SUM(sales) as total_sales
    FROM 'data/*.csv'
    GROUP BY date
    ORDER BY date
""").df()

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

In this code, DuckDB:

Uses SQL pattern matching to find relevant files
Automatically combines matching files
Processes data efficiently in a single pass
Maintains type consistency across files

Conclusion

DuckDB’s pattern-based CSV processing significantly simplifies working with multiple files, making it an excellent choice for data professionals who need to analyze data across multiple CSV files efficiently.

Link to DuckDB.

Search

DuckDB

Streamline Pattern-Based CSV Processing with DuckDB SQL

Streamline Pattern-Based CSV Processing with DuckDB SQL

Motivation

Introduction to DuckDB

Pattern-Based CSV Processing

Conclusion

Search

Related Posts

Query Nested Parquet Files Easily Using DuckDB

DuckDB: Simplify DataFrame Analysis with Serverless SQL

DuckDB: Query Pandas DataFrames Faster with Columnar Storage

Leave a Comment Cancel Reply

Related Posts

Simplify Tabular Dataset Preparation with TabularPandas

Query Nested Parquet Files Easily Using DuckDB

RapidFuzz: Find Similar Strings Despite Typos and Variations

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Streamline Pattern-Based CSV Processing with DuckDB SQL

Streamline Pattern-Based CSV Processing with DuckDB SQL

Motivation

Introduction to DuckDB

Pattern-Based CSV Processing

Conclusion

Search

Related Posts

Leave a Comment Cancel Reply

Related Posts

Stay up-to-date with data skills using CodeCut

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut