Optimizing PySpark Queries with Nested Data Structures

Khuyen Tran

Hierarchical relationships in data can often lead to increased query complexity, reduced readability, and performance overhead due to multiple join operations. In this blog post, we’ll explore how PySpark’s nested StructType can help simplify hierarchical data and reduce the need for multiple joins.

The Traditional Way: Separate DataFrames

Traditionally, we would create separate DataFrames for each table and then join them later. Here’s an example:

from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, FloatType
)

spark = SparkSession.builder.getOrCreate()

# Define schemas
customer_schema = StructType([
    StructField("customer_id", IntegerType()),
    StructField("first_name", StringType()),
    StructField("last_name", StringType())
])

order_schema = StructType([
    StructField("order_id", IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("order_date", StringType()),
    StructField("total", FloatType())
])

# Create DataFrames with schemas
customer_df = spark.createDataFrame(
    [(1, "John", "Doe"), 
     (2, "Emma", "Wilson"), 
     (3, "Michael", "Brown")],
    ["customer_id", "first_name", "last_name"]
)

order_df = spark.createDataFrame(
    [(1, 1, "2023-01-01", 100.0),
     (2, 2, "2023-01-02", 150.0),
     (3, 3, "2023-01-03", 200.0)],
    ["order_id", "customer_id", "order_date", "total"]
)

# Join customer and order data
result_df = customer_df.join(order_df, "customer_id")
result_df.show()

Output:

+-----------+----------+---------+--------+----------+-----+
|customer_id|first_name|last_name|order_id|order_date|total|
+-----------+----------+---------+--------+----------+-----+
|          1|      John|      Doe|       1|2023-01-01|100.0|
|          2|      Emma|   Wilson|       2|2023-01-02|150.0|
|          3|   Michael|    Brown|       3|2023-01-03|200.0|
+-----------+----------+---------+--------+----------+-----+

The Better Way: Nested StructType

PySpark’s nested StructType allows us to maintain data hierarchy in a single DataFrame, reducing the need for multiple joins and simplifying queries. Here’s an example:

# Define nested schema
order_schema = StructType([
    StructField("order_id", IntegerType()),
    StructField("customer",
        StructType([
            StructField("customer_id", IntegerType()),
            StructField("first_name", StringType()),
            StructField("last_name", StringType()),
        ])
    ),
    StructField("order_info",
        StructType([
            StructField("order_date", StringType()),
            StructField("total", FloatType()),
        ])
    ),
])

# Sample data with nested structure
data = [
    (1, (1, "John", "Doe"), ("2023-01-01", 100.0)),
    (2, (2, "Emma", "Wilson"), ("2023-01-02", 150.0)),
    (3, (3, "Michael", "Brown"), ("2023-01-03", 200.0)),
]

df = spark.createDataFrame(data, order_schema)
df.show(truncate=False)

Output:

+--------+-------------------+-------------------+
|order_id|customer           |order_info         |
+--------+-------------------+-------------------+
|1       |{1, John, Doe}     |{2023-01-01, 100.0}|
|2       |{2, Emma, Wilson}  |{2023-01-02, 150.0}|
|3       |{3, Michael, Brown}|{2023-01-03, 200.0}|
+--------+-------------------+-------------------+

With this approach, we can access all related data in one place using intuitive dot notation.

# Select specific nested fields
df.select("customer.first_name", "customer.last_name", "order_info.total").show()

Output:

+----------+---------+-----+
|first_name|last_name|total|
+----------+---------+-----+
|      John|      Doe|100.0|
|      Emma|   Wilson|150.0|
|   Michael|    Brown|200.0|
+----------+---------+-----+

# Filter orders above $150
df.where("order_info.total >= 150").show()

Output:

+--------+-------------------+-------------------+
|order_id|           customer|         order_info|
+--------+-------------------+-------------------+
|       2|  {2, Emma, Wilson}|{2023-01-02, 150.0}|
|       3|{3, Michael, Brown}|{2023-01-03, 200.0}|
+--------+-------------------+-------------------+

Installation

To use PySpark, you can install it using pip:

pip install -U 'pyspark[sql]'

By using PySpark’s nested StructType, we can simplify hierarchical data and reduce the need for multiple joins, making our queries more efficient and easier to maintain.

Make PySpark Queries Cleaner with Column Aliasing

April 20, 2025

Update Multiple Columns in Spark 3.3 and Later

April 6, 2025

Use PySpark UDFs to Make SQL Logic Reusable

March 18, 2025

Optimizing PySpark Queries with Nested Data Structures

Table of Contents

Optimizing PySpark Queries with Nested Data Structures

Khuyen Tran

The Traditional Way: Separate DataFrames

The Better Way: Nested StructType

Installation

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Optimizing PySpark Queries with Nested Data Structures

Table of Contents

Optimizing PySpark Queries with Nested Data Structures

Khuyen Tran

The Traditional Way: Separate DataFrames

The Better Way: Nested StructType

Installation

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut