...

Streamline Pattern-Based CSV Processing with DuckDB SQL

Streamline Pattern-Based CSV Processing with DuckDB SQL

Motivation

Data analysts frequently need to aggregate data across multiple CSV files for reporting and analysis. Traditional approaches require complex file handling and multiple data processing steps, which can impact productivity and code maintainability.

Here’s a typical approach that illustrates the complexity:

import glob
import pandas as pd

# Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

# Perform grouping and aggregation
result = (
    combined_df.groupby("Date")["Sales"]
    .sum()
    .reset_index()
    .rename(columns={"Sales": "total_sales"})
    .sort_values("Date")
)
result

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

This approach requires multiple steps and explicit handling of data structures.

Introduction to DuckDB

DuckDB is an analytics-focused database system that runs directly in your application process. It excels at handling analytical queries and provides seamless integration with Python data tools.

Installation is straightforward:

pip install duckdb

As covered in our article about SQL operations on DataFrames, DuckDB provides efficient SQL functionality for data analysis. We’ve also previously discussed CSV data management with DuckDB.

Pattern-Based CSV Processing

DuckDB simplifies multiple CSV file processing through SQL pattern matching. Here’s how to use it:

First, let’s create sample data:

from pathlib import Path
import pandas as pd

# Create sample data
df1 = pd.DataFrame({
    "date": ["2023-01-01", "2023-01-02"],
    "sales": [100, 200]
})

df2 = pd.DataFrame({
    "date": ["2023-01-03", "2023-01-04"],
    "sales": [300, 400]
})

# Save files
Path("data").mkdir(exist_ok=True)
df1.to_csv("data/sales_1.csv", index=False)
df2.to_csv("data/sales_2.csv", index=False)

Process all CSV files with a single SQL query:

import duckdb

# Read and analyze all CSV files at once
result = duckdb.sql("""
    SELECT 
        date,
        SUM(sales) as total_sales
    FROM 'data/*.csv'
    GROUP BY date
    ORDER BY date
""").df()

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

In this code, DuckDB:

  • Uses SQL pattern matching to find relevant files
  • Automatically combines matching files
  • Processes data efficiently in a single pass
  • Maintains type consistency across files

Conclusion

DuckDB’s pattern-based CSV processing significantly simplifies working with multiple files, making it an excellent choice for data professionals who need to analyze data across multiple CSV files efficiently.

Link to DuckDB.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran

Seraphinite AcceleratorOptimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.