Update Multiple Columns in Spark 3.3 and Later

Table of Contents

Update Multiple Columns in Spark 3.3 and Later

Motivation

Updating multiple columns in a Spark DataFrame was cumbersome and verbose prior to Spark 3.3, requiring multiple chained withColumn calls. This made the process less intuitive, especially for users transitioning from pandas.

For example, consider the following DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, trim

spark = SparkSession.builder.getOrCreate()

data = [("   John   ", 35), ("Jane", 28)]
columns = ["first_name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-----------+---+
| first_name|age|
+-----------+---+
|   John    | 35|
|       Jane| 28|
+-----------+---+

To update multiple columns, such as trimming whitespace from first_name and adding a new column age_after_10_years, the approach before Spark 3.3 was verbose and involved chaining withColumn:

new_df = df.withColumn("first_name", trim(col("first_name"))).withColumn(
    "age_after_10_years", col("age") + 10
)

new_df.show()

Output:

+----------+---+----------------+
|first_name|age|age_after_10_years|
+----------+---+----------------+
|      John| 35|              45|
|      Jane| 28|              38|
+----------+---+----------------+

This process is inefficient and can be challenging to read for DataFrame users familiar with a more concise syntax.

Introduction to PySpark

PySpark is the Python API for Apache Spark, a distributed computing framework designed for big data processing. PySpark enables Python developers to work with Spark DataFrames, SQL, and streaming data, leveraging Spark’s scalability and performance. To install PySpark, use:

pip install -U "pyspark[sql]"

The withColumns method, introduced in PySpark 3.3, addresses the verbosity of updating multiple columns. It allows a dictionary-style syntax for defining multiple column transformations, making the process more concise and user-friendly. In this post, we will explore how to use withColumns.

Updating Multiple Columns with withColumns

To demonstrate withColumns, consider the same sample DataFrame. Using PySpark 3.3 or later, we can apply multiple column transformations in a single method call:

new_df = df.withColumns(
    {
        "first_name": trim(col("first_name")),
        "age_after_10_years": col("age") + 10,
    }
)

new_df.show()

Output:

+----------+---+----------------+
|first_name|age|age_after_10_years|
+----------+---+----------------+
|      John| 35|              45|
|      Jane| 28|              38|
+----------+---+----------------+

This approach is not only more concise but also easier to read and maintain. The dictionary-style syntax closely resembles pandas operations, making it intuitive for users familiar with pandas DataFrames.

Conclusion

The introduction of withColumns in PySpark 3.3 simplifies updating multiple columns in a DataFrame. By allowing a dictionary-style syntax, it reduces verbosity, improves readability, and provides a more intuitive experience for pandas users. This enhancement demonstrates PySpark’s commitment to usability and developer productivity.

Link to PySpark.

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran