PySpark DataFrame Transformations: select vs withColumn

PySpark’s select and withColumn both can be used to add or modify existing columns. However, their behavior are different.

Let’s explore these differences with a practical example. First, let’s create a sample DataFrame:

from pyspark.sql.functions import col, upper


data = [
    ("Alice", 28, "New York"),
    ("Bob", 35, "San Francisco"),
]
df = spark.createDataFrame(data, ["name", "age", "city"])
df.show()

Output:

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 28|     New York|
|  Bob| 35|San Francisco|
+-----+---+-------------+

Using Select

select keeps only the specified columns:

df_select = df.select("name", upper(col("city")).alias("upper_city"))
df_select.show()

Output:

+-------+-------------+
|   name|   upper_city|
+-------+-------------+
|  Alice|     NEW YORK|
|    Bob|SAN FRANCISCO|
|Charlie|  LOS ANGELES|
|  Diana|      CHICAGO|
+-------+-------------+

Using withColumn

withColumn retains all original columns plus the new/modified one:

df_withColumn = df.withColumn('upper_city', upper(col('city')))
df_withColumn.show()

Output:

+-------+---+-------------+-------------+
|   name|age|         city|   upper_city|
+-------+---+-------------+-------------+
|  Alice| 28|     New York|     NEW YORK|
|    Bob| 35|San Francisco|SAN FRANCISCO|
|Charlie| 42|  Los Angeles|  LOS ANGELES|
|  Diana| 31|      Chicago|      CHICAGO|
+-------+---+-------------+-------------+

Key Takeaway

Use select for column subset selection or major DataFrame reshaping.
Use withColumn for incremental column additions or modifications.

PySpark

PySpark Best Practices: Simplifying Logical Chain Conditions

November 9, 2024

PySpark

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

October 14, 2024

PySpark

3 Powerful Ways to Create PySpark DataFrames

September 13, 2024

PySpark

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

October 14, 2024

PySpark

3 Powerful Ways to Create PySpark DataFrames

September 13, 2024

PySpark

Distributed Data Joining with Shuffle Joins in PySpark

July 15, 2024

PySpark DataFrame Transformations: select vs withColumn

Using Select

Using withColumn

Key Takeaway

Related Posts

PySpark Best Practices: Simplifying Logical Chain Conditions

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

3 Powerful Ways to Create PySpark DataFrames

Related Posts

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

3 Powerful Ways to Create PySpark DataFrames

Distributed Data Joining with Shuffle Joins in PySpark

Get Started

Follow Us

Newsletter

PySpark DataFrame Transformations: select vs withColumn

Using Select

Using withColumn

Key Takeaway

Related Posts

PySpark Best Practices: Simplifying Logical Chain Conditions

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

3 Powerful Ways to Create PySpark DataFrames

Related Posts

Writing Safer and Cleaner Spark SQL with PySpark’s Parameterized Queries

3 Powerful Ways to Create PySpark DataFrames

Distributed Data Joining with Shuffle Joins in PySpark

Get Started

Follow Us

Newsletter

Work with Khuyen Tran

Work with Khuyen Tran