PySpark DataFrame Transformations: select vs withColumn

PySpark’s select and withColumn both can be used to add or modify existing columns. However, their behavior are different.

Let’s explore these differences with a practical example. First, let’s create a sample DataFrame:

from pyspark.sql.functions import col, upper


data = [
    ("Alice", 28, "New York"),
    ("Bob", 35, "San Francisco"),
]
df = spark.createDataFrame(data, ["name", "age", "city"])
df.show()

Output:

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 28|     New York|
|  Bob| 35|San Francisco|
+-----+---+-------------+

Using Select

select keeps only the specified columns:

    df_select = df.select("name", upper(col("city")).alias("upper_city"))
    df_select.show()

    Output:

    +-------+-------------+
    |   name|   upper_city|
    +-------+-------------+
    |  Alice|     NEW YORK|
    |    Bob|SAN FRANCISCO|
    |Charlie|  LOS ANGELES|
    |  Diana|      CHICAGO|
    +-------+-------------+

    Using withColumn

    withColumn retains all original columns plus the new/modified one:

      df_withColumn = df.withColumn('upper_city', upper(col('city')))
      df_withColumn.show()

      Output:

      +-------+---+-------------+-------------+
      |   name|age|         city|   upper_city|
      +-------+---+-------------+-------------+
      |  Alice| 28|     New York|     NEW YORK|
      |    Bob| 35|San Francisco|SAN FRANCISCO|
      |Charlie| 42|  Los Angeles|  LOS ANGELES|
      |  Diana| 31|      Chicago|      CHICAGO|
      +-------+---+-------------+-------------+

      Key Takeaway

      • Use select for column subset selection or major DataFrame reshaping.
      • Use withColumn for incremental column additions or modifications.
      Scroll to Top

      Work with Khuyen Tran

      Work with Khuyen Tran