Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.
import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd
df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})
# Perform the groupby and sum operation in parallel
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()
Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.
import polars as pl
df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})
# Perform the groupby and sum operation in parallel
result = df.group_by("A").sum()