Accelerating Complex Calculations: From Pandas to DuckDB

For complex aggregations, Pandas repeatedly scans the full dataset to compute metrics like averages and sums. This approach becomes increasingly inefficient as aggregation complexity or data volume grows.

DuckDB reads only necessary data columns and processes information in chunks. This approach makes it much faster for complex calculations, especially with large datasets

In the code below, aggregating data using DuckDB is nearly 6 times faster compared to aggregating with pandas.

import pandas as pd
import duckdb

df = pd.read_parquet("lineitemsf1.snappy.parquet")
%%timeit
df.groupby('l_returnflag').agg(
  Sum=('l_extendedprice', 'sum'),
  Min=('l_extendedprice', 'min'),
  Max=('l_extendedprice', 'max'),
  Avg=('l_extendedprice', 'mean')
)

Output:

226 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
duckdb.query("""
SELECT
      l_returnflag,
      SUM(l_extendedprice),
      MIN(l_extendedprice),
      MAX(l_extendedprice),
      AVG(l_extendedprice)
FROM df
GROUP BY
        l_returnflag
""").to_df()

Output:

37 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Link to DuckDB.

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran