Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Query Nested Parquet Files Easily Using DuckDB

Table of Contents

Query Nested Parquet Files Easily Using DuckDB

Motivation

Handling large, nested Parquet files can be challenging as it often requires flattening the data or writing custom scripts to extract nested fields. These steps can be time-consuming, error-prone, and may not scale well for complex datasets.

For example, with pandas, you might need to preprocess the data like this:

import pandas as pd

# Create a nested dataset and save it as a Parquet file
data = {
    "id": [1, 2],
    "details": [
        {"name": "Alice", "age": 25},
        {"name": "Bob", "age": 30}
    ]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Save as a nested Parquet file
df.to_parquet("nested_data.parquet")
print(df)
   id                       details
0   1  {'name': 'Alice', 'age': 25}
1   2    {'name': 'Bob', 'age': 30}
# Read the DataFrame from the Parquet file
df = pd.read_parquet("nested_data.parquet")

# Create a new DataFrame with the flattened structure
flat_df = pd.DataFrame({
    "id": df["id"],
    "name": [detail["name"] for detail in df["details"]],
    "age": [detail["age"] for detail in df["details"]]
})

print(flat_df)

Output:

   id   name  age
0   1  Alice   25
1   2    Bob   30

This approach requires additional logic to flatten the nested data, which can lead to unnecessary complexity.

Introduction to DuckDB

DuckDB is an analytical in-process SQL database management system designed for high performance, portability, and simplicity. It supports a rich SQL dialect with advanced features, including the ability to query nested and complex data types directly.

DuckDB can be easily installed via pip for Python users:

pip install duckdb

DuckDB eliminates the need for pre-processing nested data by allowing direct SQL queries on nested Parquet files. This makes it an efficient tool for processing structured and semi-structured data.

In this post, we will demonstrate how to query nested Parquet files directly using DuckDB.

Query Nested Parquet Files

DuckDB provides a simple way to query nested Parquet files directly without the need for flattening or complex preprocessing.

Here’s how you can query a nested Parquet file using DuckDB:

import duckdb

# Query the nested Parquet file directly
query_result = duckdb.query("""
    SELECT 
        id, 
        details.name AS name, 
        details.age AS age 
    FROM read_parquet('nested_data.parquet')
""").to_df()

print(query_result)

In this example:

  • The read_parquet function reads the nested Parquet file.
  • Nested fields are accessed using dot notation, such as details.name and details.age.

Output:

   id   name  age
0   1  Alice   25
1   2    Bob   30

The result is a flattened representation of the nested data, queried directly from the Parquet file without additional preprocessing.

Conclusion

DuckDB simplifies working with nested Parquet files by allowing direct SQL queries on nested fields. This feature reduces complexity and saves time, making it a valuable tool for data analysis and engineering tasks.

Link to DuckDB.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran