Motivation
Feature engineering for statistical modeling often involves manually processing tasks such as interaction terms, polynomial transformations, and encoding of categorical variables. These tasks, when performed with libraries like pandas or NumPy, become verbose, repetitive, and error-prone, especially for larger datasets or more complex scenarios.
For example, manually encoding categorical variables and generating polynomial features might look like this:
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
"x": ["A", "B", "C"],
"z": [0.3, 0.1, 0.2],
"y": [0, 1, 2]
})
# Manual feature engineering
df["x_B"] = (df["x"] == "B").astype(int)
df["x_C"] = (df["x"] == "C").astype(int)
df["z_squared"] = df["z"] ** 2
print(df)
Output:
x z y x_B x_C z_squared
0 A 0.3 0 0 0 0.09
1 B 0.1 1 1 0 0.01
2 C 0.2 2 0 1 0.04
This approach becomes increasingly cumbersome as the complexity of transformations grows.
Introduction to Formulaic
Formulaic is a high-performance Python library that simplifies feature engineering by leveraging Wilkinson formulas. It provides powerful abstractions for defining transformations and relationships using a formula string.
Installation is straightforward—you can install it via pip:
pip install formulaic
In this post, we will explore how Formulaic addresses feature engineering challenges.
Feature Engineering with Formulaic
In this example, we will use Formulaic to simplify feature engineering for a small dataset.
from formulaic import Formula
# Define a formula for feature engineering
formula = "y ~ x + I(z**2)"
# Apply the formula to get the response and design matrices
y, X = Formula(formula).get_model_matrix(df)
print("Response (y):")
print(y)
print("\nDesign Matrix (X):")
print(X)
Here’s how Formulaic simplifies the process:
"y ~ x + I(z**2)"
defines the relationships:y
is the response variable.x
is treated as a categorical predictor and is one-hot encoded.I(z**2)
computes the square ofz
(polynomial transformation).
Formula.get_model_matrix()
automatically generates:- A response matrix (
y
). - A design matrix (
X
) containing the intercept, encoded categorical variables, and specified transformations.
- A response matrix (
The above code produces:
Response (y):
y
0 0
1 1
2 2
Design Matrix (X):
Intercept x[T.B] x[T.C] I(z ** 2)
0 1.0 0.0 0.0 0.090000
1 1.0 1.0 0.0 0.010000
2 1.0 0.0 1.0 0.040000
Formulaic automatically handled the encoding of x
into one-hot variables (x[T.B]
and x[T.C]
), created an intercept term, and computed the square of z
(I(z ** 2)
), all based on the provided formula.
Conclusion
Formulaic provides a high-performance, user-friendly approach to feature engineering by leveraging the power of Wilkinson formulas. It abstracts away the complexities of encoding, transformation, and interaction term creation, allowing data scientists to focus on model development and analysis.