Optimizing PySpark Queries with Nested Data Structures

Hierarchical relationships in data can often lead to increased query complexity, reduced readability, and performance overhead due to multiple join operations. In this blog post, we’ll explore how PySpark’s nested StructType can help simplify hierarchical data and reduce the need for multiple joins.

The Traditional Way: Separate DataFrames

Traditionally, we would create separate DataFrames for each table and then join them later. Here’s an example:

from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, FloatType
)

spark = SparkSession.builder.getOrCreate()

# Define schemas
customer_schema = StructType([
    StructField("customer_id", IntegerType()),
    StructField("first_name", StringType()),
    StructField("last_name", StringType())
])

order_schema = StructType([
    StructField("order_id", IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("order_date", StringType()),
    StructField("total", FloatType())
])

# Create DataFrames with schemas
customer_df = spark.createDataFrame(
    [(1, "John", "Doe"), 
     (2, "Emma", "Wilson"), 
     (3, "Michael", "Brown")],
    ["customer_id", "first_name", "last_name"]
)

order_df = spark.createDataFrame(
    [(1, 1, "2023-01-01", 100.0),
     (2, 2, "2023-01-02", 150.0),
     (3, 3, "2023-01-03", 200.0)],
    ["order_id", "customer_id", "order_date", "total"]
)

# Join customer and order data
result_df = customer_df.join(order_df, "customer_id")
result_df.show()

Output:

+-----------+----------+---------+--------+----------+-----+
|customer_id|first_name|last_name|order_id|order_date|total|
+-----------+----------+---------+--------+----------+-----+
|          1|      John|      Doe|       1|2023-01-01|100.0|
|          2|      Emma|   Wilson|       2|2023-01-02|150.0|
|          3|   Michael|    Brown|       3|2023-01-03|200.0|
+-----------+----------+---------+--------+----------+-----+

The Better Way: Nested StructType

PySpark’s nested StructType allows us to maintain data hierarchy in a single DataFrame, reducing the need for multiple joins and simplifying queries. Here’s an example:

# Define nested schema
order_schema = StructType([
    StructField("order_id", IntegerType()),
    StructField("customer",
        StructType([
            StructField("customer_id", IntegerType()),
            StructField("first_name", StringType()),
            StructField("last_name", StringType()),
        ])
    ),
    StructField("order_info",
        StructType([
            StructField("order_date", StringType()),
            StructField("total", FloatType()),
        ])
    ),
])

# Sample data with nested structure
data = [
    (1, (1, "John", "Doe"), ("2023-01-01", 100.0)),
    (2, (2, "Emma", "Wilson"), ("2023-01-02", 150.0)),
    (3, (3, "Michael", "Brown"), ("2023-01-03", 200.0)),
]

df = spark.createDataFrame(data, order_schema)
df.show(truncate=False)

Output:

+--------+-------------------+-------------------+
|order_id|customer           |order_info         |
+--------+-------------------+-------------------+
|1       |{1, John, Doe}     |{2023-01-01, 100.0}|
|2       |{2, Emma, Wilson}  |{2023-01-02, 150.0}|
|3       |{3, Michael, Brown}|{2023-01-03, 200.0}|
+--------+-------------------+-------------------+

With this approach, we can access all related data in one place using intuitive dot notation.

# Select specific nested fields
df.select("customer.first_name", "customer.last_name", "order_info.total").show()

Output:

+----------+---------+-----+
|first_name|last_name|total|
+----------+---------+-----+
|      John|      Doe|100.0|
|      Emma|   Wilson|150.0|
|   Michael|    Brown|200.0|
+----------+---------+-----+
# Filter orders above $150
df.where("order_info.total >= 150").show()

Output:

+--------+-------------------+-------------------+
|order_id|           customer|         order_info|
+--------+-------------------+-------------------+
|       2|  {2, Emma, Wilson}|{2023-01-02, 150.0}|
|       3|{3, Michael, Brown}|{2023-01-03, 200.0}|
+--------+-------------------+-------------------+

Installation

To use PySpark, you can install it using pip:

pip install -U 'pyspark[sql]'

By using PySpark’s nested StructType, we can simplify hierarchical data and reduce the need for multiple joins, making our queries more efficient and easier to maintain.

Leave a Comment

Your email address will not be published. Required fields are marked *

Related Posts

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran