Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Pandas Alternatives

Accelerate DataFrame Operations with Polars Parallel Processing

Motivation

Data engineers frequently need to process multiple related datasets together. When using pandas, each DataFrame is typically processed sequentially, which can be inefficient and time-consuming.

Here’s a common inefficient approach with pandas:

import numpy as np
import pandas as pd

def add_metric_scaled(df, metric_column):
return df.assign(
metric_scaled=lambda x: (x[metric_column] – x[metric_column].mean())
/ x[metric_column].std()
)

# Create the first DataFrame with purchases data
df1 = pd.DataFrame(
{"user_id": range(1000), "purchases": np.random.randint(1, 100, 1000)}
)
df1 = add_metric_scaled(df1, "purchases")

# Create the second DataFrame with clicks data
df2 = pd.DataFrame({"user_id": range(1000), "clicks": np.random.randint(1, 500, 1000)})
df2 = add_metric_scaled(df2, "clicks")

# Create the third DataFrame with page_views data
df3 = pd.DataFrame(
{"user_id": range(1000), "page_views": np.random.randint(1, 1000, 1000)}
)
df3 = add_metric_scaled(df3, "page_views")

This sequential approach has several drawbacks:

Each DataFrame is processed one after another

CPU cores remain underutilized

Total processing time increases linearly with the number of DataFrames

Memory usage isn’t optimized

Understanding Parallel DataFrame Collection

Modern CPUs have multiple cores that can process data simultaneously. While pandas operations are primarily single-threaded, Polars is designed for parallel execution, allowing multiple DataFrame operations to run concurrently.

Introduction to Polars

This article covers how to speed up your data processing workflows by taking advantage of Polars’ ability to execute operations in parallel across multiple DataFrames. If you’re new to Polars or considering transitioning from Pandas, this detailed comparison article Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames provides valuable insights into the advantages and differences between the two libraries.Polars is a high-performance DataFrame library that excels at parallel processing. Install it using:

pip install polars

Parallel Collection with collect_all

Let’s solve the sequential processing problem using Polars’ collect_all:

import numpy as np
import polars as pl

def add_metric_scaled(df, metric_column):
return df.with_columns(
[
(pl.col(metric_column) – pl.col(metric_column).mean())
/ pl.col(metric_column).std().alias("metric_scaled")
]
)

# Create the first LazyFrame with purchases data
lazy_frame1 = add_metric_scaled(
pl.DataFrame(
{"user_id": range(1000), "purchases": np.random.randint(1, 100, 1000)}
).lazy(),
"purchases",
)

# Create the second LazyFrame with clicks data
lazy_frame2 = add_metric_scaled(
pl.DataFrame(
{"user_id": range(1000), "clicks": np.random.randint(1, 500, 1000)}
).lazy(),
"clicks",
)

# Create the third LazyFrame with page_views data
lazy_frame3 = add_metric_scaled(
pl.DataFrame(
{"user_id": range(1000), "page_views": np.random.randint(1, 1000, 1000)}
).lazy(),
"page_views",
)

# Process all frames in parallel
results = pl.collect_all([lazy_frame1, lazy_frame2, lazy_frame3])
print(results)

Output:

[shape: (1_000, 2)
┌─────────┬───────────┐
│ user_id ┆ purchases │
│ — ┆ — │
│ i64 ┆ f64 │
╞═════════╪═══════════╡
│ 0 ┆ -1.553524 │
│ 1 ┆ -0.528352 │
│ 2 ┆ -1.200017 │
│ 3 ┆ -1.093965 │
│ 4 ┆ -1.412121 │
│ … ┆ … │
│ 995 ┆ 1.027081 │
│ 996 ┆ -1.553524 │
│ 997 ┆ -0.669755 │
│ 998 ┆ -0.705106 │
│ 999 ┆ 0.03726 │
└─────────┴───────────┘, shape: (1_000, 2)
┌─────────┬───────────┐
│ user_id ┆ clicks │
│ — ┆ — │
│ i64 ┆ f64 │
╞═════════╪═══════════╡
│ 0 ┆ -1.32932 │
│ 1 ┆ 1.250184 │
│ 2 ┆ -0.560815 │
│ 3 ┆ 0.047306 │
│ 4 ┆ 1.31701 │
│ … ┆ … │
│ 995 ┆ 1.611047 │
│ 996 ┆ 1.169992 │
│ 997 ┆ 0.354708 │
│ 998 ┆ -0.914995 │
│ 999 ┆ 1.136579 │
└─────────┴───────────┘, shape: (1_000, 2)
┌─────────┬────────────┐
│ user_id ┆ page_views │
│ — ┆ — │
│ i64 ┆ f64 │
╞═════════╪════════════╡
│ 0 ┆ 0.042274 │
│ 1 ┆ 1.50377 │
│ 2 ┆ -0.368771 │
│ 3 ┆ -1.72487 │
│ 4 ┆ -1.742436 │
│ … ┆ … │
│ 995 ┆ -0.814949 │
│ 996 ┆ 1.531876 │
│ 997 ┆ -1.728383 │
│ 998 ┆ -0.249322 │
│ 999 ┆ 0.741403 │
└─────────┴────────────┘]

The benefits include:

All DataFrames are processed simultaneously

Better CPU utilization through parallel processing

Reduced total processing time

Optimized memory usage through Polars’ efficient memory management

Conclusion

Polars’ collect_all function provides a significant performance improvement over sequential pandas processing by executing multiple DataFrame computations in parallel. This approach is particularly valuable when dealing with multiple related datasets that need similar transformations applied.

Link to Polars

.stk-b237d21-container{background-color:var(–ast-global-color-0) !important;}.stk-b237d21-container:before{background-color:var(–ast-global-color-0) !important;}.stk-b237d21 .stk-block-call-to-action__content{max-width:700px !important;min-width:auto !important;}
.stk-3e42139 .stk-block-heading__text{color:var(–ast-global-color-4) !important;font-family:”Comfortaa”, Sans-serif !important;}Want the full walkthrough?

.stk-e915396 .stk-block-text__text{color:var(–ast-global-color-4) !important;}Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames

.stk-c77f1ca {border-top-left-radius:1px !important;border-top-right-radius:1px !important;border-bottom-right-radius:1px !important;border-bottom-left-radius:1px !important;overflow:hidden !important;border-style:solid !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-c77f1ca .stk-button{min-height:0px !important;padding-top:18px !important;padding-right:29px !important;padding-bottom:18px !important;padding-left:29px !important;background:var(–ast-global-color-1) !important;border-top-left-radius:13px !important;border-top-right-radius:13px !important;border-bottom-right-radius:13px !important;border-bottom-left-radius:13px !important;}.stk-c77f1ca .stk-button:before{border-style:solid !important;border-color:var(–ast-global-color-2) !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-c77f1ca .stk-button .stk–inner-svg svg:last-child, .stk-c77f1ca .stk-button .stk–inner-svg svg:last-child :is(g, path, rect, polygon, ellipse){fill:var(–ast-global-color-1) !important;}.stk-c77f1ca .stk-button__inner-text{font-size:17px !important;}@media screen and (max-width: 1023px){.stk-c77f1ca .stk-button__inner-text{font-size:17px !important;}}View the in-depth guide

Favorite

Accelerate DataFrame Operations with Polars Parallel Processing Read More »

Polars: Blazing Fast DataFrame Library

If you want data manipulation library that’s both fast and memory-efficient, try Polars. Polars provides a high-level API similar to Pandas but with better performance for large datasets.

Performance Comparison: Polars vs Pandas

To compare the performance of these two libraries, create two Pandas DataFrames, each with 1 million rows.

import pandas as pd
import polars as pl
import numpy as np
import time

# Create two Pandas DataFrames with 1 million rows each
pandas_df1 = pd.DataFrame({
'key': np.random.randint(0, 1000, size=1_000_000),
'value1': np.random.rand(1_000_000)
})

pandas_df2 = pd.DataFrame({
'key': np.random.randint(0, 1000, size=1_000_000),
'value2': np.random.rand(1000000)
})

# Create two Polars DataFrames from the Pandas DataFrames
polars_df1 = pl.from_pandas(pandas_df1)
polars_df2 = pl.from_pandas(pandas_df2)

Next, we merge the two Pandas DataFrames on the ‘key’ column using the merge method. We also measure the execution time using the time library.

# Merge the two DataFrames on the 'key' column
start_time = time.time()
pandas_merged = pd.merge(pandas_df1, pandas_df2, on='key')
pandas_time = time.time() – start_time

Similarly, we merge the two Polars DataFrames on the ‘key’ column using the join method. We also measure the execution time using the time library.

start_time = time.time()
polars_merged = polars_df1.join(polars_df2, on='key')
polars_time = time.time() – start_time

Print the execution times for both Pandas and Polars:

print(f"Pandas time: {pandas_time:.6f} seconds")
print(f"Polars time: {polars_time:.6f} seconds")

On my test machine, the results were:

Pandas time: 127.604390 seconds
Polars time: 41.079080 seconds

This means that Polars is approximately 3.11 times faster than Pandas for this specific task.

Conclusion

Polars is a fast and memory-efficient data manipulation library that provides a high-level API similar to Pandas. With its ability to handle large datasets and perform complex operations quickly, Polars is an excellent choice for data scientists and analysts who need to work with big data.

Getting Started with Polars

To get started with Polars, simply install it using pip:

pip install polars

You can then import Polars in your Python code and start using its powerful features.

Learn More

To learn more about Polars and its features, check out the official documentation on GitHub.

.stk-1a596d7-container{background-color:var(–ast-global-color-0) !important;}.stk-1a596d7-container:before{background-color:var(–ast-global-color-0) !important;}.stk-1a596d7 .stk-block-call-to-action__content{max-width:700px !important;min-width:auto !important;}
.stk-g163anz .stk-block-heading__text{color:var(–ast-global-color-4) !important;font-family:”Comfortaa”, Sans-serif !important;}Want the full walkthrough?

.stk-np38k13 .stk-block-text__text{color:var(–ast-global-color-4) !important;}Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames

.stk-d8x1ya5 {border-top-left-radius:1px !important;border-top-right-radius:1px !important;border-bottom-right-radius:1px !important;border-bottom-left-radius:1px !important;overflow:hidden !important;border-style:solid !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-d8x1ya5 .stk-button{min-height:0px !important;padding-top:18px !important;padding-right:29px !important;padding-bottom:18px !important;padding-left:29px !important;background:var(–ast-global-color-1) !important;border-top-left-radius:13px !important;border-top-right-radius:13px !important;border-bottom-right-radius:13px !important;border-bottom-left-radius:13px !important;}.stk-d8x1ya5 .stk-button:before{border-style:solid !important;border-color:var(–ast-global-color-2) !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-d8x1ya5 .stk-button .stk–inner-svg svg:last-child, .stk-d8x1ya5 .stk-button .stk–inner-svg svg:last-child :is(g, path, rect, polygon, ellipse){fill:var(–ast-global-color-1) !important;}.stk-d8x1ya5 .stk-button__inner-text{font-size:17px !important;}@media screen and (max-width: 1023px){.stk-d8x1ya5 .stk-button__inner-text{font-size:17px !important;}}View the in-depth guide

Favorite

Polars: Blazing Fast DataFrame Library Read More »

Building a High-Performance Data Stack with Polars and Delta Lake

Polars is a DataFrame library written in Rust that has blazing-fast performance. Delta Lake has helpful features including ACID transactions, time travel, schema enforcement, and more.

Combining these two tools makes the code exceptionally powerful and efficient for data processing and analysis.

To read a Delta table in a Polars DataFrame, use polars.DataFrame.read_delta.

Building a High-Performance Data Stack with Polars and Delta Lake Read More »

Accelerating Complex Calculations: From Pandas to DuckDB

For complex aggregations, Pandas repeatedly scans the full dataset to compute metrics like averages and sums. This approach becomes increasingly inefficient as aggregation complexity or data volume grows.

DuckDB reads only necessary data columns and processes information in chunks. This approach makes it much faster for complex calculations, especially with large datasets

In the code below, aggregating data using DuckDB is nearly 6 times faster compared to aggregating with pandas.

import pandas as pd
import duckdb

df = pd.read_parquet("lineitemsf1.snappy.parquet")

%%timeit
df.groupby('l_returnflag').agg(
Sum=('l_extendedprice', 'sum'),
Min=('l_extendedprice', 'min'),
Max=('l_extendedprice', 'max'),
Avg=('l_extendedprice', 'mean')
)

Output:

226 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
duckdb.query("""
SELECT
l_returnflag,
SUM(l_extendedprice),
MIN(l_extendedprice),
MAX(l_extendedprice),
AVG(l_extendedprice)
FROM df
GROUP BY
l_returnflag
""").to_df()

Output:

37 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Link to DuckDB.
Favorite

Accelerating Complex Calculations: From Pandas to DuckDB Read More »

Pandas vs Polars: Syntax Comparison for Data Scientists

Both Pandas and Polars are robust data manipulation tools, but their syntaxes differ subtly. Let’s delve into how these libraries handle common data tasks.

To begin, we’ll create equivalent dataframes in both Pandas and Polars:

import pandas as pd
import polars as pl

# Sample data
sample_data = {
"Category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics"],
"Quantity": [5, 2, 3, 10, 4],
"Price": [200, 30, 150, 20, 300],
}

# Dataframe creation
pandas_df = pd.DataFrame(sample_data)
polars_df = pl.DataFrame(sample_data)

Key Operations Comparison

Column Selection

Pandas:

pandas_df[["Category", "Price"]]

Polars:

polars_df.select(["Category", "Price"])

Row Filtering

Pandas:

pandas_df[pandas_df["Quantity"] > 3]

Polars:

polars_df.filter(pl.col("Quantity") > 3)

Grouping and Aggregation

Pandas:

pandas_df.groupby("Category").agg(
{
"Quantity": "sum",
"Price": "mean",
}
)

Polars:

polars_df.group_by("Category").agg(
[
pl.col("Quantity").sum(),
pl.col("Price").mean(),
]
)

Polars tends to use more explicit, verb-based methods, while Pandas leverages more concise bracket notation.

The choice between Pandas and Polars often comes down to performance needs, library familiarity, and personal preference. Polars is known for its speed and efficiency. Pandas, on the other hand, has a larger ecosystem and is more widely adopted.

Run in Google Colab.

.stk-0e66a75-container{background-color:var(–ast-global-color-0) !important;}.stk-0e66a75-container:before{background-color:var(–ast-global-color-0) !important;}.stk-0e66a75 .stk-block-call-to-action__content{max-width:700px !important;min-width:auto !important;}
.stk-ph3hzep .stk-block-heading__text{color:var(–ast-global-color-4) !important;font-family:”Comfortaa”, Sans-serif !important;}Want the full walkthrough?

.stk-uho2bs5 .stk-block-text__text{color:var(–ast-global-color-4) !important;}Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames.

.stk-7gt6h0r {border-top-left-radius:1px !important;border-top-right-radius:1px !important;border-bottom-right-radius:1px !important;border-bottom-left-radius:1px !important;overflow:hidden !important;border-style:solid !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-7gt6h0r .stk-button{min-height:0px !important;padding-top:18px !important;padding-right:29px !important;padding-bottom:18px !important;padding-left:29px !important;background:var(–ast-global-color-1) !important;border-top-left-radius:13px !important;border-top-right-radius:13px !important;border-bottom-right-radius:13px !important;border-bottom-left-radius:13px !important;}.stk-7gt6h0r .stk-button:before{border-style:solid !important;border-color:var(–ast-global-color-2) !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-7gt6h0r .stk-button .stk–inner-svg svg:last-child, .stk-7gt6h0r .stk-button .stk–inner-svg svg:last-child :is(g, path, rect, polygon, ellipse){fill:var(–ast-global-color-1) !important;}.stk-7gt6h0r .stk-button__inner-text{font-size:17px !important;}@media screen and (max-width: 1023px){.stk-7gt6h0r .stk-button__inner-text{font-size:17px !important;}}View the in-depth guide

Favorite

Pandas vs Polars: Syntax Comparison for Data Scientists Read More »

Polars: Write Queries Intuitively, Process Data Efficiently

Polars allows you to write queries intuitively while delivering top-notch performance with these features:

Write your queries in a straightforward manner, and Polars will automatically determine the most efficient execution plan.

Utilize the power of your machine by dividing the workload among the available CPU cores without any additional configuration.

Link to Polars.
Favorite

Polars: Write Queries Intuitively, Process Data Efficiently Read More »

Pandas vs Polars: Harness Parallelism for Faster Data Processing

Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.

import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd

df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()

Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.

import polars as pl

df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel
result = df.group_by("A").sum()

Link to Polars.

Interact with this code in Google Colab.

.stk-01f1a72-container{background-color:var(–ast-global-color-0) !important;}.stk-01f1a72-container:before{background-color:var(–ast-global-color-0) !important;}.stk-01f1a72 .stk-block-call-to-action__content{max-width:700px !important;min-width:auto !important;}
.stk-1zr9ofs .stk-block-heading__text{color:var(–ast-global-color-4) !important;font-family:”Comfortaa”, Sans-serif !important;}Want the full walkthrough?

.stk-5krlmof .stk-block-text__text{color:var(–ast-global-color-4) !important;}Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames

.stk-fbmq1um {border-top-left-radius:1px !important;border-top-right-radius:1px !important;border-bottom-right-radius:1px !important;border-bottom-left-radius:1px !important;overflow:hidden !important;border-style:solid !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-fbmq1um .stk-button{min-height:0px !important;padding-top:18px !important;padding-right:29px !important;padding-bottom:18px !important;padding-left:29px !important;background:var(–ast-global-color-1) !important;border-top-left-radius:13px !important;border-top-right-radius:13px !important;border-bottom-right-radius:13px !important;border-bottom-left-radius:13px !important;}.stk-fbmq1um .stk-button:before{border-style:solid !important;border-color:var(–ast-global-color-2) !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-fbmq1um .stk-button .stk–inner-svg svg:last-child, .stk-fbmq1um .stk-button .stk–inner-svg svg:last-child :is(g, path, rect, polygon, ellipse){fill:var(–ast-global-color-1) !important;}.stk-fbmq1um .stk-button__inner-text{font-size:17px !important;}@media screen and (max-width: 1023px){.stk-fbmq1um .stk-button__inner-text{font-size:17px !important;}}View the in-depth guide

Favorite

Pandas vs Polars: Harness Parallelism for Faster Data Processing Read More »

Polars’ Streaming Mode: A Solution for Large Data Sets

The default collect method in Polars processes your data as a single batch, which means that all the data must fit into your available memory.

If your data requires more memory than you have available, use the streaming mode to process it in batches. To use streaming mode, simply pass the streaming=True argument to the collect method.

.stk-68f562f-container{background-color:var(–ast-global-color-0) !important;}.stk-68f562f-container:before{background-color:var(–ast-global-color-0) !important;}.stk-68f562f .stk-block-call-to-action__content{max-width:700px !important;min-width:auto !important;}
.stk-qw2rkpa .stk-block-heading__text{color:var(–ast-global-color-4) !important;font-family:”Comfortaa”, Sans-serif !important;}Want the full walkthrough?

.stk-9pzw8m3 .stk-block-text__text{color:var(–ast-global-color-4) !important;}Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames

.stk-kosdcw9 {border-top-left-radius:1px !important;border-top-right-radius:1px !important;border-bottom-right-radius:1px !important;border-bottom-left-radius:1px !important;overflow:hidden !important;border-style:solid !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-kosdcw9 .stk-button{min-height:0px !important;padding-top:18px !important;padding-right:29px !important;padding-bottom:18px !important;padding-left:29px !important;background:var(–ast-global-color-1) !important;border-top-left-radius:13px !important;border-top-right-radius:13px !important;border-bottom-right-radius:13px !important;border-bottom-left-radius:13px !important;}.stk-kosdcw9 .stk-button:before{border-style:solid !important;border-color:var(–ast-global-color-2) !important;border-top-width:1px !important;border-right-width:1px !important;border-bottom-width:1px !important;border-left-width:1px !important;}.stk-kosdcw9 .stk-button .stk–inner-svg svg:last-child, .stk-kosdcw9 .stk-button .stk–inner-svg svg:last-child :is(g, path, rect, polygon, ellipse){fill:var(–ast-global-color-1) !important;}.stk-kosdcw9 .stk-button__inner-text{font-size:17px !important;}@media screen and (max-width: 1023px){.stk-kosdcw9 .stk-button__inner-text{font-size:17px !important;}}View the in-depth guide

Favorite

Polars’ Streaming Mode: A Solution for Large Data Sets Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran