| 📅 Today’s Picks |
PySpark: Avoid Double Conversions with applyInArrow
Problem:
applyInPandas lets you apply Pandas functions in PySpark by converting data from Arrow→Pandas→Arrow for each operation.
This double conversion adds serialization overhead that slows down your transformations.
Solution:
applyInArrow (introduced in PySpark 4.0.0) works directly with PyArrow tables, eliminating the Pandas conversion step entirely.
This keeps data in Arrow’s columnar format throughout the pipeline, making operations significantly faster.
Trade-off: PyArrow’s syntax is less intuitive than Pandas, but it’s worth it if you’re processing large datasets where performance matters.
Full Article:
| ⭐ Related Post |
Batch Process DataFrames with PySpark Pandas UDF Vectorization
Problem:
Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.
Solution:
Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.
As a result, they can be 10 to 100 times faster on large DataFrames.
| ☕️ Weekly Finds |
causal-learn
ML
Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms
POT
ML
Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning
qdrant
MLOps
High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI


