Pandas vs Polars: Harness Parallelism for Faster Data Processing

Khuyen Tran

Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.

import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd


df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()

Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.

import polars as pl

df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
result = df.group_by("A").sum()

Link to Polars.

Interact with this code in Google Colab.