Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Simplify Tabular Dataset Preparation with TabularPandas

Table of Contents

Simplify Tabular Dataset Preparation with TabularPandas

Motivation

Preparing tabular datasets for machine learning often involves handling missing values, encoding categorical features, and normalizing continuous variables. This process can become cumbersome when using traditional tools like scikit-learn pipelines, as it requires manually defining and chaining preprocessing steps.

For instance, using scikit-learn pipelines for preprocessing might look like this:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Sample dataset
data = {
    "age": [25, 30, None, 22, 35],
    "salary": [50000, 60000, 45000, None, 80000],
    "job": ["engineer", "doctor", "nurse", "engineer", None],
    "target": [1, 0, 1, 0, 1],
}
df = pd.DataFrame(data)

# Define preprocessing steps for columns
numerical_features = ["age", "salary"]
categorical_features = ["job"]

numerical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Apply preprocessing to dataset
X = df.drop(columns=["target"])
y = df["target"]

X_preprocessed = pd.DataFrame(preprocessor.fit_transform(X))

print(X_preprocessed)

Output:

          0         1    2    3    4    5
0 -0.677631 -0.729800  0.0  1.0  0.0  0.0
1  0.451754  0.104257  1.0  0.0  0.0  0.0
2  0.000000 -1.146829  0.0  0.0  1.0  0.0
3 -1.355262  0.000000  0.0  1.0  0.0  0.0
4  1.581139  1.772373  0.0  0.0  0.0  1.0

While functional, this approach requires a lot of manual effort to define the preprocessing steps for different feature types, making it less efficient for complex datasets.

Introduction to fastai

fastai is a deep learning library built on PyTorch, designed to be both high-level and flexible. One of its components, TabularPandas, simplifies the preparation of tabular datasets by automating common preprocessing tasks, such as handling missing values, encoding categorical variables, and normalizing numerical features.

fastai can be installed via pip as follows:

pip install fastai

TabularPandas provides a streamlined interface for working with tabular data, reducing the time and effort required for preprocessing while maintaining flexibility.

Simplify Dataset Preparation with TabularPandas

Using TabularPandas, you can handle preprocessing tasks in a single step. Here’s how:

from fastai.tabular.all import (
    Categorify,
    FillMissing,
    Normalize,
    RandomSplitter,
    TabularPandas,
    range_of,
)

# Define preprocessing steps
procs = [FillMissing, Categorify, Normalize]

# Define categorical and continuous variables
cat_names = ["job"]
cont_names = ["age", "salary"]
y_names = "target"

# Create a TabularPandas object
to = TabularPandas(
    df,
    procs=procs,
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=y_names,
    splits=RandomSplitter(valid_pct=0.2)(range_of(df)),
)

# Access the processed training and validation datasets
train_df = to.train.xs
valid_df = to.valid.xs

print(train_df)

Output:

   job  age_na  salary_na       age    salary
3    2       1          2 -1.556802 -0.100504
4    0       1          1  1.234705  1.507557
1    1       1          1  0.161048 -0.100504
2    3       2          1  0.161048 -1.306549

In this example:

  • FillMissing automatically fills missing values in continuous variables.
  • Categorify encodes categorical variables into numeric labels.
  • Normalize normalizes continuous variables to improve model performance.
  • RandomSplitter splits the dataset into training and validation sets.

The output is a fully preprocessed dataset, ready for training without requiring manual configuration of preprocessing pipelines.

Conclusion

TabularPandas from fastai streamlines tabular data preparation by automating common preprocessing steps. Compared to traditional tools like scikit-learn, it reduces the complexity and effort required for dataset preparation, making it an excellent choice for working with tabular data.

Link to fastai

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran