Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow

📅 Today’s Picks

⭐ Related Post

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem:

Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.

Solution:

Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.

As a result, they can be 10 to 100 times faster on large DataFrames.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️ Weekly Finds

causal-learn ML

Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms

POT ML

Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning

qdrant MLOps

High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI