Formulaic: Write Clear Feature Engineering Code

Formulaic: Write Clear Feature Engineering Code

Motivation

Feature engineering for statistical modeling often involves manually processing tasks such as interaction terms, polynomial transformations, and encoding of categorical variables. These tasks, when performed with libraries like pandas or NumPy, become verbose, repetitive, and error-prone, especially for larger datasets or more complex scenarios.

For example, manually encoding categorical variables and generating polynomial features might look like this:

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    "x": ["A", "B", "C"],
    "z": [0.3, 0.1, 0.2],
    "y": [0, 1, 2]
})

# Manual feature engineering
df["x_B"] = (df["x"] == "B").astype(int)
df["x_C"] = (df["x"] == "C").astype(int)
df["z_squared"] = df["z"] ** 2

print(df)

Output:

   x    z  y  x_B  x_C  z_squared
0  A  0.3  0    0    0       0.09
1  B  0.1  1    1    0       0.01
2  C  0.2  2    0    1       0.04

This approach becomes increasingly cumbersome as the complexity of transformations grows.

Introduction to Formulaic

Formulaic is a high-performance Python library that simplifies feature engineering by leveraging Wilkinson formulas. It provides powerful abstractions for defining transformations and relationships using a formula string.

Installation is straightforward—you can install it via pip:

pip install formulaic

In this post, we will explore how Formulaic addresses feature engineering challenges.

Feature Engineering with Formulaic

In this example, we will use Formulaic to simplify feature engineering for a small dataset.

from formulaic import Formula

# Define a formula for feature engineering
formula = "y ~ x + I(z**2)"

# Apply the formula to get the response and design matrices
y, X = Formula(formula).get_model_matrix(df)

print("Response (y):")
print(y)

print("\nDesign Matrix (X):")
print(X)

Here’s how Formulaic simplifies the process:

  • "y ~ x + I(z**2)" defines the relationships:
    • y is the response variable.
    • x is treated as a categorical predictor and is one-hot encoded.
    • I(z**2) computes the square of z (polynomial transformation).
  • Formula.get_model_matrix() automatically generates:
    • A response matrix (y).
    • A design matrix (X) containing the intercept, encoded categorical variables, and specified transformations.

The above code produces:

Response (y):
   y
0  0
1  1
2  2

Design Matrix (X):
   Intercept  x[T.B]  x[T.C]    I(z ** 2)
0        1.0     0.0     0.0  0.090000
1        1.0     1.0     0.0  0.010000
2        1.0     0.0     1.0  0.040000

Formulaic automatically handled the encoding of x into one-hot variables (x[T.B] and x[T.C]), created an intercept term, and computed the square of z (I(z ** 2)), all based on the provided formula.

Conclusion

Formulaic provides a high-performance, user-friendly approach to feature engineering by leveraging the power of Wilkinson formulas. It abstracts away the complexities of encoding, transformation, and interaction term creation, allowing data scientists to focus on model development and analysis.

Link to Formulaic.

Search

Related Posts

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran