Optimizing PySpark Queries: DataFrame API or SQL?

June 24, 2024

Optimizing PySpark Queries: DataFrame API or SQL?

Khuyen Tran

PySpark queries with different syntax (DataFrame API or parameterized SQL) can have the same performance, as the physical plan is identical. Here is an example:

from pyspark.sql.functions import col

fruits = spark.createDataFrame(
    [("apple", 4), ("orange", 3), ("banana", 2)], ["item", "price"]
)
fruits.show()

Use the DataFrame API to filter rows where the price is greater than 3.

fruits.where(col("price") > 3).explain()

== Physical Plan ==
*(1) Filter (isnotnull(price#1L) AND (price#1L > 3))
+- *(1) Scan ExistingRDD[item#0,price#1L]

Use the spark.sql() method to execute an equivalent SQL query.

spark.sql("select * from {df} where price > 3", df=fruits).explain()

== Physical Plan ==
*(1) Filter (isnotnull(price#1L) AND (price#1L > 3))
+- *(1) Scan ExistingRDD[item#0,price#1L]

The physical plan for both queries is the same, indicating identical performance.

The choice between DataFrame API and spark.sql() depends on the following:

Familiarity: Use spark.sql() if your team prefers SQL syntax. Use the DataFrame API if chained method calls are more intuitive for your team.
Complexity of Transformations: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries.

Run this code in Google Colab.