PySpark DataFrame Transformations: select vs withColumn

PySpark DataFrame Transformations: select vs withColumn

PySpark’s select and withColumn both can be used to add or modify existing columns. However, their behavior are different.

Let’s explore these differences with a practical example. First, let’s create a sample DataFrame:

from pyspark.sql.functions import col, upper

data = [
    ("Alice", 28, "New York"),
    ("Bob", 35, "San Francisco"),
df = spark.createDataFrame(data, ["name", "age", "city"])


| name|age|         city|
|Alice| 28|     New York|
|  Bob| 35|San Francisco|

Using Select

select keeps only the specified columns:

    df_select ="name", upper(col("city")).alias("upper_city"))


    |   name|   upper_city|
    |  Alice|     NEW YORK|
    |    Bob|SAN FRANCISCO|
    |Charlie|  LOS ANGELES|
    |  Diana|      CHICAGO|

    Using withColumn

    withColumn retains all original columns plus the new/modified one:

      df_withColumn = df.withColumn('upper_city', upper(col('city')))


      |   name|age|         city|   upper_city|
      |  Alice| 28|     New York|     NEW YORK|
      |    Bob| 35|San Francisco|SAN FRANCISCO|
      |Charlie| 42|  Los Angeles|  LOS ANGELES|
      |  Diana| 31|      Chicago|      CHICAGO|

      Key Takeaway

      • Use select for column subset selection or major DataFrame reshaping.
      • Use withColumn for incremental column additions or modifications.
      Scroll to Top

      Work with Khuyen Tran

      Work with Khuyen Tran