Simplify Tabular Dataset Preparation with TabularPandas

Khuyen Tran

Motivation

Preparing tabular datasets for machine learning often involves handling missing values, encoding categorical features, and normalizing continuous variables. This process can become cumbersome when using traditional tools like scikit-learn pipelines, as it requires manually defining and chaining preprocessing steps.

For instance, using scikit-learn pipelines for preprocessing might look like this:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Sample dataset
data = {
    "age": [25, 30, None, 22, 35],
    "salary": [50000, 60000, 45000, None, 80000],
    "job": ["engineer", "doctor", "nurse", "engineer", None],
    "target": [1, 0, 1, 0, 1],
}
df = pd.DataFrame(data)

# Define preprocessing steps for columns
numerical_features = ["age", "salary"]
categorical_features = ["job"]

numerical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Apply preprocessing to dataset
X = df.drop(columns=["target"])
y = df["target"]

X_preprocessed = pd.DataFrame(preprocessor.fit_transform(X))

print(X_preprocessed)

Output:

          0         1    2    3    4    5
0 -0.677631 -0.729800  0.0  1.0  0.0  0.0
1  0.451754  0.104257  1.0  0.0  0.0  0.0
2  0.000000 -1.146829  0.0  0.0  1.0  0.0
3 -1.355262  0.000000  0.0  1.0  0.0  0.0
4  1.581139  1.772373  0.0  0.0  0.0  1.0

While functional, this approach requires a lot of manual effort to define the preprocessing steps for different feature types, making it less efficient for complex datasets.

Introduction to fastai

fastai is a deep learning library built on PyTorch, designed to be both high-level and flexible. One of its components, TabularPandas, simplifies the preparation of tabular datasets by automating common preprocessing tasks, such as handling missing values, encoding categorical variables, and normalizing numerical features.

fastai can be installed via pip as follows:

pip install fastai

TabularPandas provides a streamlined interface for working with tabular data, reducing the time and effort required for preprocessing while maintaining flexibility.

Simplify Dataset Preparation with TabularPandas

Using TabularPandas, you can handle preprocessing tasks in a single step. Here’s how:

from fastai.tabular.all import (
    Categorify,
    FillMissing,
    Normalize,
    RandomSplitter,
    TabularPandas,
    range_of,
)

# Define preprocessing steps
procs = [FillMissing, Categorify, Normalize]

# Define categorical and continuous variables
cat_names = ["job"]
cont_names = ["age", "salary"]
y_names = "target"

# Create a TabularPandas object
to = TabularPandas(
    df,
    procs=procs,
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=y_names,
    splits=RandomSplitter(valid_pct=0.2)(range_of(df)),
)

# Access the processed training and validation datasets
train_df = to.train.xs
valid_df = to.valid.xs

print(train_df)

Output:

   job  age_na  salary_na       age    salary
3    2       1          2 -1.556802 -0.100504
4    0       1          1  1.234705  1.507557
1    1       1          1  0.161048 -0.100504
2    3       2          1  0.161048 -1.306549

In this example:

FillMissing automatically fills missing values in continuous variables.
Categorify encodes categorical variables into numeric labels.
Normalize normalizes continuous variables to improve model performance.
RandomSplitter splits the dataset into training and validation sets.

The output is a fully preprocessed dataset, ready for training without requiring manual configuration of preprocessing pipelines.

Conclusion

TabularPandas from fastai streamlines tabular data preparation by automating common preprocessing steps. Compared to traditional tools like scikit-learn, it reduces the complexity and effort required for dataset preparation, making it an excellent choice for working with tabular data.

Link to fastai