Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Formulaic: Write Clear Feature Engineering Code

Table of Contents

Formulaic: Write Clear Feature Engineering Code

Motivation

Feature engineering for statistical modeling often involves manually processing tasks such as interaction terms, polynomial transformations, and encoding of categorical variables. These tasks, when performed with libraries like pandas or NumPy, become verbose, repetitive, and error-prone, especially for larger datasets or more complex scenarios.

For example, manually encoding categorical variables and generating polynomial features might look like this:

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    "x": ["A", "B", "C"],
    "z": [0.3, 0.1, 0.2],
    "y": [0, 1, 2]
})

# Manual feature engineering
df["x_B"] = (df["x"] == "B").astype(int)
df["x_C"] = (df["x"] == "C").astype(int)
df["z_squared"] = df["z"] ** 2

print(df)

Output:

   x    z  y  x_B  x_C  z_squared
0  A  0.3  0    0    0       0.09
1  B  0.1  1    1    0       0.01
2  C  0.2  2    0    1       0.04

This approach becomes increasingly cumbersome as the complexity of transformations grows.

Introduction to Formulaic

Formulaic is a high-performance Python library that simplifies feature engineering by leveraging Wilkinson formulas. It provides powerful abstractions for defining transformations and relationships using a formula string.

Installation is straightforward—you can install it via pip:

pip install formulaic

In this post, we will explore how Formulaic addresses feature engineering challenges.

Feature Engineering with Formulaic

In this example, we will use Formulaic to simplify feature engineering for a small dataset.

from formulaic import Formula

# Define a formula for feature engineering
formula = "y ~ x + I(z**2)"

# Apply the formula to get the response and design matrices
y, X = Formula(formula).get_model_matrix(df)

print("Response (y):")
print(y)

print("\nDesign Matrix (X):")
print(X)

Here’s how Formulaic simplifies the process:

  • "y ~ x + I(z**2)" defines the relationships:
    • y is the response variable.
    • x is treated as a categorical predictor and is one-hot encoded.
    • I(z**2) computes the square of z (polynomial transformation).
  • Formula.get_model_matrix() automatically generates:
    • A response matrix (y).
    • A design matrix (X) containing the intercept, encoded categorical variables, and specified transformations.

The above code produces:

Response (y):
   y
0  0
1  1
2  2

Design Matrix (X):
   Intercept  x[T.B]  x[T.C]    I(z ** 2)
0        1.0     0.0     0.0  0.090000
1        1.0     1.0     0.0  0.010000
2        1.0     0.0     1.0  0.040000

Formulaic automatically handled the encoding of x into one-hot variables (x[T.B] and x[T.C]), created an intercept term, and computed the square of z (I(z ** 2)), all based on the provided formula.

Conclusion

Formulaic provides a high-performance, user-friendly approach to feature engineering by leveraging the power of Wilkinson formulas. It abstracts away the complexities of encoding, transformation, and interaction term creation, allowing data scientists to focus on model development and analysis.

Link to Formulaic.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran