Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Streamline Pattern-Based CSV Processing with DuckDB SQL

Table of Contents

Streamline Pattern-Based CSV Processing with DuckDB SQL

Motivation

Data analysts frequently need to aggregate data across multiple CSV files for reporting and analysis. Traditional approaches require complex file handling and multiple data processing steps, which can impact productivity and code maintainability.

Here’s a typical approach that illustrates the complexity:

import glob
import pandas as pd

# Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

# Perform grouping and aggregation
result = (
    combined_df.groupby("Date")["Sales"]
    .sum()
    .reset_index()
    .rename(columns={"Sales": "total_sales"})
    .sort_values("Date")
)
result

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

This approach requires multiple steps and explicit handling of data structures.

Introduction to DuckDB

DuckDB is an analytics-focused database system that runs directly in your application process. It excels at handling analytical queries and provides seamless integration with Python data tools.

Installation is straightforward:

pip install duckdb

As covered in our article about SQL operations on DataFrames, DuckDB provides efficient SQL functionality for data analysis. We’ve also previously discussed CSV data management with DuckDB.

Pattern-Based CSV Processing

DuckDB simplifies multiple CSV file processing through SQL pattern matching. Here’s how to use it:

First, let’s create sample data:

from pathlib import Path
import pandas as pd

# Create sample data
df1 = pd.DataFrame({
    "date": ["2023-01-01", "2023-01-02"],
    "sales": [100, 200]
})

df2 = pd.DataFrame({
    "date": ["2023-01-03", "2023-01-04"],
    "sales": [300, 400]
})

# Save files
Path("data").mkdir(exist_ok=True)
df1.to_csv("data/sales_1.csv", index=False)
df2.to_csv("data/sales_2.csv", index=False)

Process all CSV files with a single SQL query:

import duckdb

# Read and analyze all CSV files at once
result = duckdb.sql("""
    SELECT 
        date,
        SUM(sales) as total_sales
    FROM 'data/*.csv'
    GROUP BY date
    ORDER BY date
""").df()

Output:

         Date  total_sales
0  2023-01-01        1200
1  2023-01-02         800
2  2023-01-03         600
3  2023-02-01        1500
4  2023-02-02         400
5  2023-02-03          50

In this code, DuckDB:

  • Uses SQL pattern matching to find relevant files
  • Automatically combines matching files
  • Processes data efficiently in a single pass
  • Maintains type consistency across files

Conclusion

DuckDB’s pattern-based CSV processing significantly simplifies working with multiple files, making it an excellent choice for data professionals who need to analyze data across multiple CSV files efficiently.

Link to DuckDB.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran