Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow


📅 Today’s Picks

PySpark: Avoid Double Conversions with applyInArrow

Code example: PySpark: Avoid Double Conversions with applyInArrow

Problem

applyInPandas lets you apply Pandas functions in PySpark by converting data from Arrow→Pandas→Arrow for each operation.

This double conversion adds serialization overhead that slows down your transformations.

Solution

applyInArrow (introduced in PySpark 4.0.0) works directly with PyArrow tables, eliminating the Pandas conversion step entirely.

This keeps data in Arrow’s columnar format throughout the pipeline, making operations significantly faster.

Trade-off: PyArrow’s syntax is less intuitive than Pandas, but it’s worth it if you’re processing large datasets where performance matters.


☕️ Weekly Finds

causal-learn [ML] – Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms

POT [ML] – Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning

qdrant [MLOps] – High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran