PySpark DataFrame Transformations: select vs withColumn

Khuyen Tran

PySpark’s select and withColumn both can be used to add or modify existing columns. However, their behavior are different.

Let’s explore these differences with a practical example. First, let’s create a sample DataFrame:

from pyspark.sql.functions import col, upper


data = [
    ("Alice", 28, "New York"),
    ("Bob", 35, "San Francisco"),
]
df = spark.createDataFrame(data, ["name", "age", "city"])
df.show()

Output:

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 28|     New York|
|  Bob| 35|San Francisco|
+-----+---+-------------+

Using Select

select keeps only the specified columns:

df_select = df.select("name", upper(col("city")).alias("upper_city"))
df_select.show()

Output:

+-------+-------------+
|   name|   upper_city|
+-------+-------------+
|  Alice|     NEW YORK|
|    Bob|SAN FRANCISCO|
|Charlie|  LOS ANGELES|
|  Diana|      CHICAGO|
+-------+-------------+

Using withColumn

withColumn retains all original columns plus the new/modified one:

df_withColumn = df.withColumn('upper_city', upper(col('city')))
df_withColumn.show()

Output:

+-------+---+-------------+-------------+
|   name|age|         city|   upper_city|
+-------+---+-------------+-------------+
|  Alice| 28|     New York|     NEW YORK|
|    Bob| 35|San Francisco|SAN FRANCISCO|
|Charlie| 42|  Los Angeles|  LOS ANGELES|
|  Diana| 31|      Chicago|      CHICAGO|
+-------+---+-------------+-------------+

Key Takeaway

Use select for column subset selection or major DataFrame reshaping.
Use withColumn for incremental column additions or modifications.

Natural‑Language Queries for Spark: Using LangChain to Run SQL on DataFrames

June 15, 2025

Make PySpark Queries Cleaner with Column Aliasing

April 20, 2025

Update Multiple Columns in Spark 3.3 and Later

April 6, 2025

PySpark DataFrame Transformations: select vs withColumn

Table of Contents

PySpark DataFrame Transformations: select vs withColumn

Khuyen Tran

Using Select

Using withColumn

Key Takeaway

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

PySpark DataFrame Transformations: select vs withColumn

Table of Contents

PySpark DataFrame Transformations: select vs withColumn

Khuyen Tran

Using Select

Using withColumn

Key Takeaway

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut