Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

performance-optimization

Auto-created tag for performance-optimization

PySpark 4.0: 4 Features That Change How You Process Data

Table of Contents

Introduction
From Pandas UDFs to Arrow UDFs: Next-Gen Performance
Native Data Visualization (PySpark 4.0+)
Schema-Free JSON Handling with Variant Type (PySpark 4.0+)
Dynamic Schema Generation with UDTF analyze() (PySpark 4.0+)
Conclusion

Introduction
PySpark 4.0 introduces transformative improvements that enhance performance, streamline workflows, and enable flexible data transformations in distributed processing.
This release delivers four key enhancements:

Arrow-optimized UDFs accelerate custom transformations by operating directly on Arrow data structures, eliminating the serialization overhead of Pandas UDFs.
Native Plotly visualization enables direct DataFrame plotting without conversion, streamlining exploratory data analysis and reducing memory overhead.
Variant type for JSON enables schema-free JSON querying with JSONPath syntax, eliminating verbose StructType definitions for nested data.
Dynamic schema UDTFs adapt output columns to match input data at runtime, enabling flexible pivot tables and aggregations where column structure depends on data values.

For comprehensive coverage of core PySpark SQL functionality, see the Complete Guide to PySpark SQL.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

From Pandas UDFs to Arrow UDFs: Next-Gen Performance
The pandas_udf function requires converting Arrow data to Pandas format and back again for each operation. This serialization cost becomes significant when processing large datasets.
PySpark 3.5+ introduces Arrow-optimized UDFs via the useArrow=True parameter, which operates directly on Arrow data structures, avoiding the Pandas conversion entirely and improving performance.
Let’s compare the performance with a weighted sum calculation across multiple columns on 100,000 rows:
import pandas as pd
import pyarrow.compute as pc
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, udf
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("UDFComparison").getOrCreate()

# Create test data with multiple numeric columns
data = [(float(i), float(i*2), float(i*3)) for i in range(100000)]
df = spark.createDataFrame(data, ["val1", "val2", "val3"])

Create a timing decorator to measure the execution time of the functions:
import time
from functools import wraps

# Timing decorator
def timer(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() – start
print(f"{func.__name__}: {elapsed:.2f}s")
wrapper.elapsed_time = elapsed
return result

return wrapper

Use the timing decorator to measure the execution time of the pandas_udf function:
@pandas_udf(DoubleType())
def weighted_sum_pandas(v1: pd.Series, v2: pd.Series, v3: pd.Series) -> pd.Series:
return v1 * 0.5 + v2 * 0.3 + v3 * 0.2

@timer
def run_pandas_udf():
result = df.select(
weighted_sum_pandas(df.val1, df.val2, df.val3).alias("weighted")
)
result.count() # Trigger computation
return result

result_pandas = run_pandas_udf()
pandas_time = run_pandas_udf.elapsed_time

run_pandas_udf: 1.33s

Use the timing decorator to measure the execution time of the Arrow-optimized UDF using useArrow:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType(), useArrow=True)
def weighted_sum_arrow(v1, v2, v3):
term1 = pc.multiply(v1, 0.5)
term2 = pc.multiply(v2, 0.3)
term3 = pc.multiply(v3, 0.2)
return pc.add(pc.add(term1, term2), term3)

@timer
def run_arrow_udf():
result = df.select(
weighted_sum_arrow(df.val1, df.val2, df.val3).alias("weighted")
)
result.count() # Trigger computation
return result

result_arrow = run_arrow_udf()
arrow_time = run_arrow_udf.elapsed_time

run_arrow_udf: 0.43s

Measure the speedup:
speedup = pandas_time / arrow_time
print(f"Speedup: {speedup:.2f}x faster")

Speedup: 3.06x faster

The output shows that the Arrow-optimized version is 3.06x faster than the pandas_udf version!
The performance gain comes from avoiding serialization. Arrow-optimized UDFs use PyArrow compute functions like pc.multiply() and pc.add() directly on Arrow data, while pandas_udf must convert each column to Pandas and back.
Trade-off: The 3.06x performance improvement comes at the cost of using PyArrow’s less familiar compute API instead of Pandas operations. However, this becomes increasingly valuable as dataset size and column count grow.
Native Data Visualization (PySpark 4.0+)
Visualizing PySpark DataFrames traditionally requires converting to Pandas first, then using external libraries like matplotlib or plotly. This adds memory overhead and extra processing steps.
PySpark 4.0 introduces a native plotting API powered by Plotly, enabling direct visualization from PySpark DataFrames without any conversion.
Let’s visualize sales data across product categories:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Visualization").getOrCreate()

# Create sample sales data
sales_data = [
("Electronics", 5000, 1200),
("Electronics", 7000, 1800),
("Clothing", 3000, 800),
("Clothing", 4500, 1100),
("Furniture", 6000, 1500),
("Furniture", 8000, 2000),
]

sales_df = spark.createDataFrame(sales_data, ["category", "sales", "profit"])
sales_df.show()

+———–+—–+——+
| category|sales|profit|
+———–+—–+——+
|Electronics| 5000| 1200|
|Electronics| 7000| 1800|
| Clothing| 3000| 800|
| Clothing| 4500| 1100|
| Furniture| 6000| 1500|
| Furniture| 8000| 2000|
+———–+—–+——+

Create a scatter plot directly from the PySpark DataFrame using the .plot() method:
# Direct plotting without conversion
sales_df.plot(kind="scatter", x="sales", y="profit", color="category")

You can also use shorthand methods such as plot.scatter() and plot.bar() for specific chart types:
# Scatter plot with shorthand
sales_df.plot.scatter(x="sales", y="profit", color="category")

# Bar chart by category
category_totals = sales_df.groupBy("category").agg({"sales": "sum"}).withColumnRenamed("sum(sales)", "total_sales")
category_totals.plot.bar(x="category", y="total_sales")

The native plotting API supports 8 chart types:

scatter: Scatter plots with color grouping
bar: Bar charts for categorical comparisons
line: Line plots for time series
area: Area charts for cumulative values
pie: Pie charts for proportions
box: Box plots for distributions
histogram: Histograms for frequency analysis
kde/density: Density plots for probability distributions

By default, PySpark visualizes up to 1,000 rows. For larger datasets, configure the limit:
# Increase visualization row limit
spark.conf.set("spark.sql.pyspark.plotting.max_rows", 5000)

Schema-Free JSON Handling with Variant Type (PySpark 4.0+)
Extracting nested JSON in PySpark requires defining StructType schemas that mirror your data structure. This creates verbose code that breaks whenever your JSON changes.
Let’s extract data from a 3-level nested JSON structure using the traditional approach:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("json_schema").getOrCreate()

# 3 levels of nested StructType – verbose and hard to maintain
schema = StructType([
StructField("user", StructType([
StructField("name", StringType()),
StructField("profile", StructType([
StructField("settings", StructType([
StructField("theme", StringType())
]))
]))
]))
])

json_data = [
'{"user": {"name": "Alice", "profile": {"settings": {"theme": "dark"}}}}',
'{"user": {"name": "Bob", "profile": {"settings": {"theme": "light"}}}}'
]

rdd = spark.sparkContext.parallelize(json_data)
df = spark.read.schema(schema).json(rdd)
df.select("user.name", "user.profile.settings.theme").show()

+—–+—–+
| name|theme|
+—–+—–+
|Alice| dark|
| Bob|light|
+—–+—–+

PySpark 4.0 introduces the Variant type, which lets you skip schema definitions entirely. To work with the Variant type:

Use parse_json() to load JSON data
Use variant_get() to extract fields with JSONPath syntax

from pyspark.sql import SparkSession
from pyspark.sql.functions import parse_json, variant_get

spark = SparkSession.builder.appName("json_variant").getOrCreate()

json_data = [
('{"user": {"name": "Alice", "profile": {"settings": {"theme": "dark"}}}}',),
('{"user": {"name": "Bob", "profile": {"settings": {"theme": "light"}}}}',)
]

df = spark.createDataFrame(json_data, ["json_str"])
df_variant = df.select(parse_json("json_str").alias("data"))

# No schema needed – just use JSONPath
result = df_variant.select(
variant_get("data", "$.user.name", "string").alias("name"),
variant_get("data", "$.user.profile.settings.theme", "string").alias("theme")
)
result.show()

+—–+—–+
| name|theme|
+—–+—–+
|Alice| dark|
| Bob|light|
+—–+—–+

The Variant type provides several advantages:

No upfront schema definition: Handle any JSON structure without verbose StructType definitions
JSONPath syntax: Access nested paths using $.path.to.field notation regardless of depth
Schema flexibility: JSON structure changes don’t break your code
Type safety: variant_get() lets you specify the expected type when extracting fields

Dynamic Schema Generation with UDTF analyze() (PySpark 4.0+)
Python UDTFs (User-Defined Table Functions) generate multiple rows from a single input row, but they come with a critical limitation: you must define the output schema upfront. When your output columns depend on the input data itself (like creating pivot tables or dynamic aggregations where column names come from data values), this rigid schema requirement becomes a problem.
For example, a word-counting UDTF requires you to specify all output columns upfront, even though the words themselves are unknown until runtime.
from pyspark.sql.functions import udtf, lit
from pyspark.sql.types import StructType, StructField, IntegerType

# Schema must be defined upfront with fixed column names
@udtf(returnType=StructType([
StructField("hello", IntegerType()),
StructField("world", IntegerType()),
StructField("spark", IntegerType())
]))
class StaticWordCountUDTF:
def eval(self, text: str):
words = text.split(" ")
yield tuple(words.count(word) for word in ["hello", "world", "spark"])

# Only works for exactly these three words
result = StaticWordCountUDTF(lit("hello world hello spark"))
result.show()

+—–+—–+—–+
|hello|world|spark|
+—–+—–+—–+
| 2| 1| 1|
+—–+—–+—–+

If the input text contains a different set of words, the output won’t contain the count of the new words.
result = StaticWordCountUDTF(lit("hi world hello spark"))
result.show()

+—–+—–+—–+
|hello|world|spark|
+—–+—–+—–+
| 1| 1| 1|
+—–+—–+—–+

PySpark 4.0 introduces the analyze() method for UDTFs, enabling dynamic schema determination based on input data. Instead of hardcoding your output schema, analyze() inspects the input and generates the appropriate columns at runtime.
from pyspark.sql.functions import udtf, lit
from pyspark.sql.types import StructType, IntegerType
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult

@udtf
class DynamicWordCountUDTF:
@staticmethod
def analyze(text: AnalyzeArgument) -> AnalyzeResult:
"""Dynamically create schema based on input text"""
schema = StructType()
# Create one column per unique word in the input
for word in sorted(set(text.value.split(" "))):
schema = schema.add(word, IntegerType())
return AnalyzeResult(schema=schema)

def eval(self, text: str):
"""Generate counts for each word"""
words = text.split(" ")
# Use same logic as analyze() to determine column order
unique_words = sorted(set(words))
yield tuple(words.count(word) for word in unique_words)

# Schema adapts to any input text
result = DynamicWordCountUDTF(lit("hello world hello spark"))
result.show()

+—–+—–+—–+
|hello|spark|world|
+—–+—–+—–+
| 2| 1| 1|
+—–+—–+—–+

Now try with completely different words:
# Different words – schema adapts automatically
result2 = DynamicWordCountUDTF(lit("python data science"))
result2.show()

+—-+——+——-+
|data|python|science|
+—-+——+——-+
| 1| 1| 1|
+—-+——+——-+

The columns change from hello, spark, world to data, python, science without any code modifications.
Conclusion
PySpark 4.0 makes distributed computing faster and easier to use. Arrow-optimized UDFs speed up custom transformations, the Variant type simplifies JSON handling, native visualization removes conversion steps, and dynamic UDTFs handle flexible data structures.
These improvements address real bottlenecks without requiring major code changes, making PySpark more practical for everyday data engineering tasks.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

PySpark 4.0: 4 Features That Change How You Process Data Read More »

5 Essential Itertools for Data Science

Table of Contents

Introduction
Feature Interactions: From Nested Loops to Combinations
Custom Code
With Itertools

Polynomial Features: From Manual Assignments to Automated Generation
Custom Code
With Itertools

Sequence Patterns: From Manual Tracking to Permutations
Custom Code
With Itertools

Cartesian Products: From Nested Loops to Product
Custom Code
With Itertools

Efficient Sampling: From Full Data Loading to Islice
Custom Code
With Itertools

Final Thoughts

Introduction
Imagine you write nested loops for combinatorial features and they work great initially. However, as your feature engineering scales, this custom code becomes buggy and nearly impossible to debug or extend.
for i in range(len(numerical_features)):
for j in range(i + 1, len(numerical_features)):
df[f"{numerical_features[i]}_x_{numerical_features[j]}"] = (
df[numerical_features[i]] * df[numerical_features[j]]
)

Itertools provides battle-tested, efficient functions that make data science code faster and more reliable. Here are the five most useful functions for data science projects:

combinations() – Generate unique pairs from lists without repetition
combinations_with_replacement() – Generate combinations including self-pairs
permutations() – Generate all possible orderings
product() – Create all possible pairings across multiple lists
islice() – Extract slices from iterators without loading full datasets

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Key Takeaways
Here’s what you’ll learn:

Replace complex nested loops with battle-tested itertools functions for feature interactions
Generate polynomial features systematically using combinations_with_replacement
Create sequence patterns and categorical combinations without manual index management
Sample large datasets efficiently with islice to avoid memory waste
Eliminate feature engineering bugs with mathematically precise combinatorial functions

Setup
Before we dive into the examples, let’s set up the sample dataset.
import pandas as pd
from itertools import (
combinations,
combinations_with_replacement,
product,
islice,
permutations,
)
import numpy as np

# Create simple sample dataset
np.random.seed(42)
data = {
"age": np.random.randint(20, 65, 20),
"income": np.random.randint(30000, 120000, 20),
"experience": np.random.randint(0, 40, 20),
"education_years": np.random.randint(12, 20, 20),
}
df = pd.DataFrame(data)
numerical_features = ["age", "income", "experience", "education_years"]
print(df.head())

age
income
experience
education_years

58
94925
2
15

48
97969
36
13

34
35311
6
19

62
113104
20
15

27
83707
8
13

Feature Interactions: From Nested Loops to Combinations
Custom Code
Creating feature interactions manually requires careful index management to avoid duplicates and self-interactions. While possible, this approach becomes error-prone and complex as the number of features grows.
# Manual approach – proper nested loops with index management
df_manual = df.copy()

for i in range(len(numerical_features)):
for j in range(i + 1, len(numerical_features)):
feature1, feature2 = numerical_features[i], numerical_features[j]
interaction_name = f"{feature1}_x_{feature2}"
df_manual[interaction_name] = df_manual[feature1] * df_manual[feature2]

print(f"First few: {list(df_manual.columns[4:7])}")

First few: ['age_x_income', 'age_x_experience', 'age_x_education_years']

With Itertools
Use combinations() to generate unique pairs from a list without repetition or order dependency.
For example, combinations(['A','B','C'], 2) yields (A,B), (A,C), (B,C).

Let’s apply this to feature interactions:
# Automated approach with itertools.combinations
df_itertools = df.copy()

for feature1, feature2 in combinations(numerical_features, 2):
interaction_name = f"{feature1}_x_{feature2}"
df_itertools[interaction_name] = df_itertools[feature1] * df_itertools[feature2]

print(f"First few: {list(df_itertools.columns[4:7])}")

First few: ['age_x_income', 'age_x_experience', 'age_x_education_years']

📚 For comprehensive production practices in data science, check out Production-Ready Data Science.

Polynomial Features: From Manual Assignments to Automated Generation
Custom Code
Creating polynomial features manually requires separate handling of squared terms and interaction terms, involving complex logic to generate all degree-2 polynomial combinations.
df_manual_poly = df.copy()

# Create squared features
for feature in numerical_features:
df_manual_poly[f"{feature}_squared"] = df_manual_poly[feature] ** 2

# Create interaction features with list slicing
for i, feature1 in enumerate(numerical_features):
for feature2 in numerical_features[i + 1:]:
df_manual_poly[f"{feature1}_x_{feature2}"] = df_manual_poly[feature1] * df_manual_poly[feature2]

# Show polynomial features created
polynomial_features = list(df_manual_poly.columns[4:])
print(f"First few polynomial features: {polynomial_features[:6]}")

First few polynomial features: ['age_squared', 'income_squared', 'experience_squared', 'education_years_squared', 'age_x_income', 'age_x_experience']

With Itertools
Use combinations_with_replacement() to generate combinations where items can repeat.
For example, combinations_with_replacement(['A','B','C'], 2) yields (A,A), (A,B), (A,C), (B,B), (B,C), (C,C).

With this method, we can eliminate separate logic for squared terms and interaction terms in polynomial feature generation.
# Automated approach with combinations_with_replacement
df_poly = df.copy()

# Create features using the same combinations logic
for feature1, feature2 in combinations_with_replacement(numerical_features, 2):
if feature1 == feature2:
# Squared feature
df_poly[f"{feature1}_squared"] = df_poly[feature1] ** 2
else:
# Interaction feature
df_poly[f"{feature1}_x_{feature2}"] = df_poly[feature1] * df_poly[feature2]

# Show polynomial features created
polynomial_features = list(df_poly.columns[4:])
print(f"First few polynomial features: {polynomial_features[:6]}")

First few polynomial features: ['age_squared', 'age_x_income', 'age_x_experience', 'age_x_education_years', 'income_squared', 'income_x_experience']

Sequence Patterns: From Manual Tracking to Permutations
Custom Code
Creating features from ordered sequences requires manual permutation logic with nested loops. This becomes complex and error-prone as sequences grow larger.
# Manual approach – implementing permutation logic
actions = ['login', 'browse', 'purchase']
sequence_patterns = []

# Manual permutation generation for 3 items
for i in range(len(actions)):
for j in range(len(actions)):
if i != j: # Ensure different first and second
for k in range(len(actions)):
if k != i and k != j: # Ensure all different
pattern = f"{actions[i]}_{actions[j]}_{actions[k]}"
sequence_patterns.append(pattern)

print(f"First few: {sequence_patterns[:3]}")

First few: ['login_browse_purchase', 'login_purchase_browse', 'browse_login_purchase']

With Itertools
Use permutations() to generate all possible orderings where sequence matters.
For example, permutations(['A','B','C']) yields (A,B,C), (A,C,B), (B,A,C), (B,C,A), (C,A,B), (C,B,A).

Let’s apply this to user behavior sequences:
# Automated approach with itertools.permutations
actions = ['login', 'browse', 'purchase']

# Generate all sequence permutations
sequence_patterns = ['_'.join(perm) for perm in permutations(actions)]

print(f"Permutations created: {len(sequence_patterns)} patterns")
print(f"First few: {sequence_patterns[:3]}")

Permutations created: 6 patterns
First few: ['login_browse_purchase', 'login_purchase_browse', 'browse_login_purchase']

Cartesian Products: From Nested Loops to Product
Custom Code
Creating combinations between multiple categorical variables requires nested loops for each variable. The code complexity grows exponentially as more variables are added.
# Manual approach – nested loops for categorical combinations
education_levels = ['bachelor', 'master', 'phd']
locations = ['urban', 'suburban', 'rural']
age_groups = ['young', 'middle', 'senior']

categorical_combinations = []
for edu in education_levels:
for loc in locations:
for age in age_groups:
combination = f"{edu}_{loc}_{age}"
categorical_combinations.append(combination)

print(f"Manual nested loops created {len(categorical_combinations)} combinations")
print(f"First few: {categorical_combinations[:3]}")

Manual nested loops created 27 combinations
First few: ['bachelor_urban_young', 'bachelor_urban_middle', 'bachelor_urban_senior']

With Itertools
Use product() to generate all possible combinations across multiple lists.
For example, product(['A','B'], ['1','2']) yields (A,1), (A,2), (B,1), (B,2).

Let’s apply this to categorical features:
# Automated approach with itertools.product
education_levels = ['bachelor', 'master', 'phd']
locations = ['urban', 'suburban', 'rural']
age_groups = ['young', 'middle', 'senior']

# Generate all combinations
combinations_list = list(product(education_levels, locations, age_groups))
categorical_features = [f"{edu}_{loc}_{age}" for edu, loc, age in combinations_list]

print(f"Product created {len(categorical_features)} combinations")
print(f"First few: {categorical_features[:3]}")

Product created 27 combinations
First few: ['bachelor_urban_young', 'bachelor_urban_middle', 'bachelor_urban_senior']

Efficient Sampling: From Full Data Loading to Islice
Custom Code
Sampling data for prototyping typically requires loading the entire dataset into memory first. This wastes memory and time when you only need a small subset for initial feature exploration.
import sys

# Load entire dataset into memory
large_dataset = list(range(1_000_000))

# Calculate memory usage
dataset_mb = sys.getsizeof(large_dataset) / 1024 / 1024

# Sample only what we need
sample_data = large_dataset[:10]

# Print results
print(f"Loaded dataset: {len(large_dataset)} items ({dataset_mb:.1f} MB)")
print(f"Sample data: {sample_data}")

Loaded dataset: 1000000 items (7.6 MB)
Sample data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

With Itertools
Use islice() to extract a slice from an iterator without loading the full dataset.
For example, islice(large_data, 5, 10) yields items 5-9.

Let’s apply this to dataset sampling:
# Process only what you need
sample_data = list(islice(large_dataset, 10))

# Calculate memory usage
sample_kb = sys.getsizeof(sample_data) / 1024

# Print results
print(f"Processed dataset: {len(sample_data)} items ({sample_kb:.2f} KB)")
print(f"Sample data: {sample_data}")

Processed dataset: 10 items (0.18 KB)
Sample data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

For even greater memory efficiency with large-scale feature engineering, consider Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames.
Final Thoughts
Manual feature engineering requires complex index management and becomes error-prone as feature sets grow. These five itertools methods provide cleaner alternatives that express mathematical intent clearly:

combinations() generates feature interactions without nested loops
combinations_with_replacement() creates polynomial features systematically
permutations() creates sequence-based features from ordered data
product() builds categorical feature crosses efficiently
islice() samples large datasets without memory waste

Master combinations() and combinations_with_replacement() first – they solve the most common feature engineering challenges. The other three methods handle specialized tasks that become essential as your workflows grow more sophisticated.
For scaling feature engineering beyond memory limits with SQL-based approaches, explore A Deep Dive into DuckDB for Data Scientists.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

5 Essential Itertools for Data Science Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran