Formulaic: Write Clear Feature Engineering Code

Motivation

Feature engineering for statistical modeling often involves manually processing tasks such as interaction terms, polynomial transformations, and encoding of categorical variables. These tasks, when performed with libraries like pandas or NumPy, become verbose, repetitive, and error-prone, especially for larger datasets or more complex scenarios.

For example, manually encoding categorical variables and generating polynomial features might look like this:

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    "x": ["A", "B", "C"],
    "z": [0.3, 0.1, 0.2],
    "y": [0, 1, 2]
})

# Manual feature engineering
df["x_B"] = (df["x"] == "B").astype(int)
df["x_C"] = (df["x"] == "C").astype(int)
df["z_squared"] = df["z"] ** 2

print(df)

Output:

   x    z  y  x_B  x_C  z_squared
0  A  0.3  0    0    0       0.09
1  B  0.1  1    1    0       0.01
2  C  0.2  2    0    1       0.04

This approach becomes increasingly cumbersome as the complexity of transformations grows.

Introduction to Formulaic

Formulaic is a high-performance Python library that simplifies feature engineering by leveraging Wilkinson formulas. It provides powerful abstractions for defining transformations and relationships using a formula string.

Installation is straightforward—you can install it via pip:

pip install formulaic

In this post, we will explore how Formulaic addresses feature engineering challenges.

Feature Engineering with Formulaic

In this example, we will use Formulaic to simplify feature engineering for a small dataset.

from formulaic import Formula

# Define a formula for feature engineering
formula = "y ~ x + I(z**2)"

# Apply the formula to get the response and design matrices
y, X = Formula(formula).get_model_matrix(df)

print("Response (y):")
print(y)

print("\nDesign Matrix (X):")
print(X)

Here’s how Formulaic simplifies the process:

"y ~ x + I(z**2)" defines the relationships:
- y is the response variable.
- x is treated as a categorical predictor and is one-hot encoded.
- I(z**2) computes the square of z (polynomial transformation).
Formula.get_model_matrix() automatically generates:
- A response matrix (y).
- A design matrix (X) containing the intercept, encoded categorical variables, and specified transformations.

The above code produces:

Response (y):
   y
0  0
1  1
2  2

Design Matrix (X):
   Intercept  x[T.B]  x[T.C]    I(z ** 2)
0        1.0     0.0     0.0  0.090000
1        1.0     1.0     0.0  0.010000
2        1.0     0.0     1.0  0.040000

Formulaic automatically handled the encoding of x into one-hot variables (x[T.B] and x[T.C]), created an intercept term, and computed the square of z (I(z ** 2)), all based on the provided formula.

Conclusion

Formulaic provides a high-performance, user-friendly approach to feature engineering by leveraging the power of Wilkinson formulas. It abstracts away the complexities of encoding, transformation, and interaction term creation, allowing data scientists to focus on model development and analysis.

Link to Formulaic.

Search

Feature Engineer

Formulaic: Write Clear Feature Engineering Code

Formulaic: Write Clear Feature Engineering Code

Motivation

Introduction to Formulaic

Feature Engineering with Formulaic

Conclusion

Search

Related Posts

Combine SQL and Python Efficiently with Ibis

Simplify Tabular Dataset Preparation with TabularPandas

Fuzzy Joining Tables with Non-Exact Matching Entries

Leave a Comment Cancel Reply

Related Posts

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Make PySpark Queries Cleaner with Column Aliasing

Run Notebooks Like Python Scripts with Marimo

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Formulaic: Write Clear Feature Engineering Code

Formulaic: Write Clear Feature Engineering Code

Motivation

Introduction to Formulaic

Feature Engineering with Formulaic

Conclusion

Search

Related Posts

Leave a Comment Cancel Reply

Related Posts

Stay up-to-date with data skills using CodeCut

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut