data-science

Auto-created tag for data-science

Pytest for Data Scientists

As a data scientist, one way to test your Python code is by using an interactive notebook to verify the accuracy of the outputs. However, this approach does not guarantee that your code works as intended in all cases. A better approach is to identify the expected behavior of the code in various scenarios, and then verify if the code executes accordingly.

Pytest for Data Scientists Read More »

5 Essential Itertools for Data Science

Leave a Comment / Python Utilities / Khuyen Tran

Table of Contents

Introduction
Feature Interactions: From Nested Loops to Combinations

Custom Code
With Itertools

Polynomial Features: From Manual Assignments to Automated Generation

Custom Code
With Itertools

Sequence Patterns: From Manual Tracking to Permutations

Custom Code
With Itertools

Cartesian Products: From Nested Loops to Product

Custom Code
With Itertools

Efficient Sampling: From Full Data Loading to Islice

Custom Code
With Itertools

Final Thoughts

Introduction
Imagine you write nested loops for combinatorial features and they work great initially. However, as your feature engineering scales, this custom code becomes buggy and nearly impossible to debug or extend.
for i in range(len(numerical_features)):
for j in range(i + 1, len(numerical_features)):
df[f"{numerical_features[i]}_x_{numerical_features[j]}"] = (
df[numerical_features[i]] * df[numerical_features[j]]
)

Itertools provides battle-tested, efficient functions that make data science code faster and more reliable. Here are the five most useful functions for data science projects:

combinations() – Generate unique pairs from lists without repetition
combinations_with_replacement() – Generate combinations including self-pairs
permutations() – Generate all possible orderings
product() – Create all possible pairings across multiple lists
islice() – Extract slices from iterators without loading full datasets

Key Takeaways
Here’s what you’ll learn:

Replace complex nested loops with battle-tested itertools functions for feature interactions
Generate polynomial features systematically using combinations_with_replacement
Create sequence patterns and categorical combinations without manual index management
Sample large datasets efficiently with islice to avoid memory waste
Eliminate feature engineering bugs with mathematically precise combinatorial functions

Setup
Before we dive into the examples, let’s set up the sample dataset.
import pandas as pd
from itertools import (
combinations,
combinations_with_replacement,
product,
islice,
permutations,
)
import numpy as np

# Create simple sample dataset
np.random.seed(42)
data = {
"age": np.random.randint(20, 65, 20),
"income": np.random.randint(30000, 120000, 20),
"experience": np.random.randint(0, 40, 20),
"education_years": np.random.randint(12, 20, 20),
}
df = pd.DataFrame(data)
numerical_features = ["age", "income", "experience", "education_years"]
print(df.head())

age
income
experience
education_years

58
94925
2
15

48
97969
36
13

34
35311
6
19

62
113104
20
15

27
83707
8
13

Feature Interactions: From Nested Loops to Combinations
Custom Code
Creating feature interactions manually requires careful index management to avoid duplicates and self-interactions. While possible, this approach becomes error-prone and complex as the number of features grows.
# Manual approach – proper nested loops with index management
df_manual = df.copy()

for i in range(len(numerical_features)):
for j in range(i + 1, len(numerical_features)):
feature1, feature2 = numerical_features[i], numerical_features[j]
interaction_name = f"{feature1}_x_{feature2}"
df_manual[interaction_name] = df_manual[feature1] * df_manual[feature2]

print(f"First few: {list(df_manual.columns[4:7])}")

First few: ['age_x_income', 'age_x_experience', 'age_x_education_years']

With Itertools
Use combinations() to generate unique pairs from a list without repetition or order dependency.
For example, combinations(['A','B','C'], 2) yields (A,B), (A,C), (B,C).

Let’s apply this to feature interactions:
# Automated approach with itertools.combinations
df_itertools = df.copy()

for feature1, feature2 in combinations(numerical_features, 2):
interaction_name = f"{feature1}_x_{feature2}"
df_itertools[interaction_name] = df_itertools[feature1] * df_itertools[feature2]

print(f"First few: {list(df_itertools.columns[4:7])}")

First few: ['age_x_income', 'age_x_experience', 'age_x_education_years']

📚 For comprehensive production practices in data science, check out Production-Ready Data Science.

Polynomial Features: From Manual Assignments to Automated Generation
Custom Code
Creating polynomial features manually requires separate handling of squared terms and interaction terms, involving complex logic to generate all degree-2 polynomial combinations.
df_manual_poly = df.copy()

# Create squared features
for feature in numerical_features:
df_manual_poly[f"{feature}_squared"] = df_manual_poly[feature] ** 2

# Create interaction features with list slicing
for i, feature1 in enumerate(numerical_features):
for feature2 in numerical_features[i + 1:]:
df_manual_poly[f"{feature1}_x_{feature2}"] = df_manual_poly[feature1] * df_manual_poly[feature2]

# Show polynomial features created
polynomial_features = list(df_manual_poly.columns[4:])
print(f"First few polynomial features: {polynomial_features[:6]}")

First few polynomial features: ['age_squared', 'income_squared', 'experience_squared', 'education_years_squared', 'age_x_income', 'age_x_experience']

With Itertools
Use combinations_with_replacement() to generate combinations where items can repeat.
For example, combinations_with_replacement(['A','B','C'], 2) yields (A,A), (A,B), (A,C), (B,B), (B,C), (C,C).

With this method, we can eliminate separate logic for squared terms and interaction terms in polynomial feature generation.
# Automated approach with combinations_with_replacement
df_poly = df.copy()

# Create features using the same combinations logic
for feature1, feature2 in combinations_with_replacement(numerical_features, 2):
if feature1 == feature2:
# Squared feature
df_poly[f"{feature1}_squared"] = df_poly[feature1] ** 2
else:
# Interaction feature
df_poly[f"{feature1}_x_{feature2}"] = df_poly[feature1] * df_poly[feature2]

# Show polynomial features created
polynomial_features = list(df_poly.columns[4:])
print(f"First few polynomial features: {polynomial_features[:6]}")

First few polynomial features: ['age_squared', 'age_x_income', 'age_x_experience', 'age_x_education_years', 'income_squared', 'income_x_experience']

Sequence Patterns: From Manual Tracking to Permutations
Custom Code
Creating features from ordered sequences requires manual permutation logic with nested loops. This becomes complex and error-prone as sequences grow larger.
# Manual approach – implementing permutation logic
actions = ['login', 'browse', 'purchase']
sequence_patterns = []

# Manual permutation generation for 3 items
for i in range(len(actions)):
for j in range(len(actions)):
if i != j: # Ensure different first and second
for k in range(len(actions)):
if k != i and k != j: # Ensure all different
pattern = f"{actions[i]}_{actions[j]}_{actions[k]}"
sequence_patterns.append(pattern)

print(f"First few: {sequence_patterns[:3]}")

First few: ['login_browse_purchase', 'login_purchase_browse', 'browse_login_purchase']

With Itertools
Use permutations() to generate all possible orderings where sequence matters.
For example, permutations(['A','B','C']) yields (A,B,C), (A,C,B), (B,A,C), (B,C,A), (C,A,B), (C,B,A).

Let’s apply this to user behavior sequences:
# Automated approach with itertools.permutations
actions = ['login', 'browse', 'purchase']

# Generate all sequence permutations
sequence_patterns = ['_'.join(perm) for perm in permutations(actions)]

print(f"Permutations created: {len(sequence_patterns)} patterns")
print(f"First few: {sequence_patterns[:3]}")

Permutations created: 6 patterns
First few: ['login_browse_purchase', 'login_purchase_browse', 'browse_login_purchase']

Cartesian Products: From Nested Loops to Product
Custom Code
Creating combinations between multiple categorical variables requires nested loops for each variable. The code complexity grows exponentially as more variables are added.
# Manual approach – nested loops for categorical combinations
education_levels = ['bachelor', 'master', 'phd']
locations = ['urban', 'suburban', 'rural']
age_groups = ['young', 'middle', 'senior']

categorical_combinations = []
for edu in education_levels:
for loc in locations:
for age in age_groups:
combination = f"{edu}_{loc}_{age}"
categorical_combinations.append(combination)

print(f"Manual nested loops created {len(categorical_combinations)} combinations")
print(f"First few: {categorical_combinations[:3]}")

Manual nested loops created 27 combinations
First few: ['bachelor_urban_young', 'bachelor_urban_middle', 'bachelor_urban_senior']

With Itertools
Use product() to generate all possible combinations across multiple lists.
For example, product(['A','B'], ['1','2']) yields (A,1), (A,2), (B,1), (B,2).

Let’s apply this to categorical features:
# Automated approach with itertools.product
education_levels = ['bachelor', 'master', 'phd']
locations = ['urban', 'suburban', 'rural']
age_groups = ['young', 'middle', 'senior']

# Generate all combinations
combinations_list = list(product(education_levels, locations, age_groups))
categorical_features = [f"{edu}_{loc}_{age}" for edu, loc, age in combinations_list]

print(f"Product created {len(categorical_features)} combinations")
print(f"First few: {categorical_features[:3]}")

Product created 27 combinations
First few: ['bachelor_urban_young', 'bachelor_urban_middle', 'bachelor_urban_senior']

Efficient Sampling: From Full Data Loading to Islice
Custom Code
Sampling data for prototyping typically requires loading the entire dataset into memory first. This wastes memory and time when you only need a small subset for initial feature exploration.
import sys

# Load entire dataset into memory
large_dataset = list(range(1_000_000))

# Calculate memory usage
dataset_mb = sys.getsizeof(large_dataset) / 1024 / 1024

# Sample only what we need
sample_data = large_dataset[:10]

# Print results
print(f"Loaded dataset: {len(large_dataset)} items ({dataset_mb:.1f} MB)")
print(f"Sample data: {sample_data}")

Loaded dataset: 1000000 items (7.6 MB)
Sample data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

With Itertools
Use islice() to extract a slice from an iterator without loading the full dataset.
For example, islice(large_data, 5, 10) yields items 5-9.

Let’s apply this to dataset sampling:
# Process only what you need
sample_data = list(islice(large_dataset, 10))

# Calculate memory usage
sample_kb = sys.getsizeof(sample_data) / 1024

# Print results
print(f"Processed dataset: {len(sample_data)} items ({sample_kb:.2f} KB)")
print(f"Sample data: {sample_data}")

Processed dataset: 10 items (0.18 KB)
Sample data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

For even greater memory efficiency with large-scale feature engineering, consider Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames.
Final Thoughts
Manual feature engineering requires complex index management and becomes error-prone as feature sets grow. These five itertools methods provide cleaner alternatives that express mathematical intent clearly:

combinations() generates feature interactions without nested loops
combinations_with_replacement() creates polynomial features systematically
permutations() creates sequence-based features from ordered data
product() builds categorical feature crosses efficiently
islice() samples large datasets without memory waste

Master combinations() and combinations_with_replacement() first – they solve the most common feature engineering challenges. The other three methods handle specialized tasks that become essential as your workflows grow more sophisticated.
For scaling feature engineering beyond memory limits with SQL-based approaches, explore A Deep Dive into DuckDB for Data Scientists.
Favorite

5 Essential Itertools for Data Science Read More »

Managing Shared Data Science Code with Git Submodules

Leave a Comment / MLOps / Khuyen Tran

Table of Contents

What Are Git Submodules?
Using Git Submodules in Practice
Team Collaboration
Managing Submodules Through VS Code
Submodules vs Python Packaging
Conclusion

Data science teams often develop reusable code for preprocessing, feature engineering, and model utilities that multiple projects need to share. Without proper management, these shared dependencies become a source of inconsistency and wasted effort.
Consider a fintech company with three ML teams in separate repositories due to different security clearances and deployment pipelines:

Fraud Detection repo: High-security environment, quarterly releases
Credit Scoring repo: Regulatory compliance, monthly releases
Trading Algorithm repo: Real-time trading, daily releases

All three teams need the same calculate_risk_score() utility, but they can’t merge repositories due to security policies and different release cycles. Copying the utility creates version drift:
Week 1: All teams copy the same utility
Fraud Detection: calculate_risk_score() v1.0
Credit Scoring: calculate_risk_score() v1.0
Trading Algorithm: calculate_risk_score() v1.0

Week 3: Trading team fixes a critical bug but others don't know
Fraud Detection: calculate_risk_score() v1.0 (✗ still broken)
Credit Scoring: calculate_risk_score() v1.0 (✗ still broken)
Trading Algorithm: calculate_risk_score() v1.1 (✓ bug fixed)

Week 5: Each team has different versions
Fraud Detection: calculate_risk_score() v1.2 (≠ different optimization)
Credit Scoring: calculate_risk_score() v1.0 (✗ original broken version)
Trading Algorithm: calculate_risk_score() v1.3 (≠ completely different approach)

Git submodules provide the solution to this version drift problem.
Key Takeaways
Here’s what you’ll learn:

Eliminate version drift across ML projects by referencing specific code commits
Share utilities like risk calculation functions across teams without code duplication
Maintain separate repositories with different security clearances and release cycles
Update shared code versions precisely using Git submodule commands
Implement reproducible ML workflows with consistent dependency versions

What Are Git Submodules?
Git submodules let you embed one Git repository inside another as a subdirectory. Instead of copying code between projects, you reference a specific commit from a shared repository, ensuring all projects use identical code versions.
your-project/
├── main.py
└── shared-utils/ # ← Git submodule
└── features.py

This ensures every team member gets the same shared code version, preventing the version drift shown in the example above.

📚 For comprehensive Git fundamentals and production-ready workflows that complement Git submodule techniques, check out Production-Ready Data Science.

Using Git Submodules in Practice
Consider our fintech company with fraud detection, credit scoring, and trading projects that all need shared ML utilities for risk calculation and feature engineering.
The shared ml-utils repository contains common ML functions:
ml-utils/
├── __init__.py
├── features.py
└── README.md

# features.py
def calculate_risk_score(data):
return data['income'] / max(data['debt'], 1)

def extract_time_features(df, time_col):
df['hour'] = pd.to_datetime(df[time_col]).dt.hour
df['is_weekend'] = pd.to_datetime(df[time_col]).dt.dayofweek.isin([5, 6])
…
return df

def calculate_velocity(df, user_col, time_col):
df = df.copy()
df['transaction_count_1h'] = df.groupby(user_col)[time_col].transform('count')
…
return df

Imagine your fraud detection project looks like this:
fraud-detection/
├── main.py
└── README.md

To add the shared utilities to your fraud detection project, you can run:
git submodule add https://github.com/khuyentran1401/ml-utils.git ml-utils

This will transform the structure of your project to:
fraud-detection/
├── main.py
├── README.md
├── .gitmodules
└── ml-utils/ # ← Submodule directory
├── features.py # Shared ML functions
├── __init__.py
└── README.md

The .gitmodules file tracks the submodule configuration:
[submodule "ml-utils"]
path = ml-utils
url = https://github.com/khuyentran1401/ml-utils.git

Now you can use the shared utilities in your fraud detection pipeline:
# fraud_detection/train_model.py
from ml_utils.features import extract_time_features, calculate_velocity

def prepare_fraud_features(transactions_df):
# Extract time-based features for fraud detection
df = extract_time_features(transactions_df, 'transaction_time')

# Calculate transaction velocity features
df = calculate_velocity(df, 'user_id', 'transaction_time')

return df

# Fraud detection model uses consistent utilities
fraud_features = prepare_fraud_features(raw_transactions)

Team Collaboration
When a new team member joins the fraud detection team, they get the complete setup including shared ML utilities:
# Clone the fraud detection project with all ML utilities
git clone –recurse-submodules https://github.com/khuyentran1401/fraud-detection.git
cd fraud-detection

Alternatively, initialize submodules after cloning:
git clone https://github.com/khuyentran1401/fraud-detection.git
cd fraud-detection
git submodule update –init –recursive

When the code of the shared utilities is updated, you can update the submodule to the latest version:
# Update to latest ML utilities
git submodule update –remote ml-utils

This updates your local copy but doesn’t record which version your project uses. Commit this change so teammates get the same utilities version:
# Commit the submodule update
git add ml-utils
git commit -m "Update ML utilities: improved risk calculation accuracy"

For comprehensive version control of both code and data in ML projects, see our DVC guide.
Managing Submodules Through VS Code
To simplify the process of managing submodules, you can use VS Code’s Source Control panel.
To manage submodules through VS Code’s Source Control panel:

Open your main project folder in VS Code
Navigate to Source Control panel (Ctrl+Shift+G)
You’ll see separate sections for main project and each submodule
Stage and commit changes in the submodule first
Then commit the submodule update in the main project

The screenshot shows VS Code’s independent submodule management:

ml-utils submodule (top): Has staged changes ready to commit with its own message
fraud-detection main project (bottom): Shows submodule as changed, waits for submodule commit

Submodules vs Python Packaging
Python packaging lets you distribute shared utilities as installable packages:
pip install company-ml-utils==1.2.3

This works well for stable libraries with infrequent changes. However, for internal ML utilities that evolve rapidly, packaging creates bottlenecks:

Requires build/publish workflow for every change
Slower iteration during active development
Package contents are hidden – can’t debug into utility functions
Stuck with released versions – can’t access latest bug fixes until next release

Git submodules work differently by making the source code directly accessible in your project for immediate access, full debugging visibility, and precise version control.
Conclusion
Git submodules provide an effective solution for managing shared ML code across data science projects, enabling source-level access while maintaining reproducible workflows.
Use submodules when you need direct access to shared utility source code, frequent iterations on internal libraries, or full control over dependency versions. For stable, external dependencies, traditional Python packaging remains the better choice.
Favorite

Managing Shared Data Science Code with Git Submodules Read More »

data-science

Pytest for Data Scientists

5 Essential Itertools for Data Science

Managing Shared Data Science Code with Git Submodules

Drop a line

Get in touch

Follow Us on Social Media

data-science

Pytest for Data Scientists

5 Essential Itertools for Data Science

Managing Shared Data Science Code with Git Submodules

Work with Khuyen Tran

Work with Khuyen Tran