Update Multiple Columns in Spark 3.3 and Later

Khuyen Tran

Motivation

Updating multiple columns in a Spark DataFrame was cumbersome and verbose prior to Spark 3.3, requiring multiple chained withColumn calls. This made the process less intuitive, especially for users transitioning from pandas.

For example, consider the following DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, trim

spark = SparkSession.builder.getOrCreate()

data = [("   John   ", 35), ("Jane", 28)]
columns = ["first_name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-----------+---+
| first_name|age|
+-----------+---+
|   John    | 35|
|       Jane| 28|
+-----------+---+

To update multiple columns, such as trimming whitespace from first_name and adding a new column age_after_10_years, the approach before Spark 3.3 was verbose and involved chaining withColumn:

new_df = df.withColumn("first_name", trim(col("first_name"))).withColumn(
    "age_after_10_years", col("age") + 10
)

new_df.show()

Output:

+----------+---+----------------+
|first_name|age|age_after_10_years|
+----------+---+----------------+
|      John| 35|              45|
|      Jane| 28|              38|
+----------+---+----------------+

This process is inefficient and can be challenging to read for DataFrame users familiar with a more concise syntax.

Introduction to PySpark

PySpark is the Python API for Apache Spark, a distributed computing framework designed for big data processing. PySpark enables Python developers to work with Spark DataFrames, SQL, and streaming data, leveraging Spark’s scalability and performance. To install PySpark, use:

pip install -U "pyspark[sql]"

The withColumns method, introduced in PySpark 3.3, addresses the verbosity of updating multiple columns. It allows a dictionary-style syntax for defining multiple column transformations, making the process more concise and user-friendly. In this post, we will explore how to use withColumns.

Updating Multiple Columns with `withColumns`

To demonstrate withColumns, consider the same sample DataFrame. Using PySpark 3.3 or later, we can apply multiple column transformations in a single method call:

new_df = df.withColumns(
    {
        "first_name": trim(col("first_name")),
        "age_after_10_years": col("age") + 10,
    }
)

new_df.show()

Output:

+----------+---+----------------+
|first_name|age|age_after_10_years|
+----------+---+----------------+
|      John| 35|              45|
|      Jane| 28|              38|
+----------+---+----------------+

This approach is not only more concise but also easier to read and maintain. The dictionary-style syntax closely resembles pandas operations, making it intuitive for users familiar with pandas DataFrames.

Conclusion

The introduction of withColumns in PySpark 3.3 simplifies updating multiple columns in a DataFrame. By allowing a dictionary-style syntax, it reduces verbosity, improves readability, and provides a more intuitive experience for pandas users. This enhancement demonstrates PySpark’s commitment to usability and developer productivity.

Link to PySpark.

Natural‑Language Queries for Spark: Using LangChain to Run SQL on DataFrames

June 15, 2025

Make PySpark Queries Cleaner with Column Aliasing

April 20, 2025

Use PySpark UDFs to Make SQL Logic Reusable

March 18, 2025

Update Multiple Columns in Spark 3.3 and Later

Table of Contents

Update Multiple Columns in Spark 3.3 and Later

Khuyen Tran

Motivation

Introduction to PySpark

Updating Multiple Columns with `withColumns`

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Update Multiple Columns in Spark 3.3 and Later

Table of Contents

Update Multiple Columns in Spark 3.3 and Later

Khuyen Tran

Motivation

Introduction to PySpark

Updating Multiple Columns with withColumns

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Updating Multiple Columns with `withColumns`

Stay up-to-date with
data skills using
CodeCut