Feature Engineer Archives

Handling Imbalanced Datasets with imbalanced-learn

When training a sensitive classifier using an imbalanced dataset, it might work well on the majority class but work poorly on the minority class.

To deal with an imbalanced dataset, we can use imbalanced-learn to generate new samples in the classes which are under-represented.

The image above shows an imbalanced dataset before and after using RandomOverSampler from imbalanced-learn.

Handling Imbalanced Datasets with imbalanced-learn Read More »

Automated Misspelling Correction in Datasets Using skrub

Leave a Comment / Feature Engineer / Khuyen Tran

Real-world datasets often contain misspellings, particularly in manually entered categorical variables.

To merge multiple variants of the same category, use skub’s deduplicate function, which:

🔹 Measures the distance between strings and groups similar strings together
🔹 Replaces the strings in each group with the most common string

Automated Misspelling Correction in Datasets Using skrub Read More »

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Leave a Comment / Feature Engineer, Time Series / Khuyen Tran

Time series data is unique because it has a temporal order. This means that data from the future shouldn’t influence predictions about the past. However, standard cross-validation techniques like K-Fold randomly shuffle the data, potentially using future information to predict past events.

scikit-learn’s TimeSeriesSplit is a specialized cross-validator for time series data. It respects the temporal order of our data, ensuring that we always train on past data and test on future data.

Let’s explore how to use TimeSeriesSplit with a simple example:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(n_splits=3)

for i, (train_index, test_index) in enumerate(tscv.split(X)):
print(f"Fold {i}:")
print(f" Train: index={train_index}")
print(f" Test: index={test_index}")

Fold 0:
Train: index=[0 1 2]
Test: index=[3]
Fold 1:
Train: index=[0 1 2 3]
Test: index=[4]
Fold 2:
Train: index=[0 1 2 3 4]
Test: index=[5]

From the outputs, we can see that:

Temporal Integrity: The split always respects the original order of the data.

Growing Training Set: With each fold, the training set expands to include more historical data.

Forward-Moving Test Set: The test set is always a single future sample, progressing with each fold.

No Data Leakage: Future information is never used to predict past events.

This approach mimics real-world forecasting scenarios, where models use historical data to predict future outcomes.

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit Read More »

Upgini: Transform Raw Text into Enriched Numeric Features

Leave a Comment / Feature Engineer, Natural Language Processing / Khuyen Tran

Raw text data can lack the necessary context and factual details required for robust machine learning models.

Upgini can automatically enrich any text fields with relevant facts from external data sources and generate ready-to-use numeric features from these enriched representations.

Link to Upgini.

Upgini: Transform Raw Text into Enriched Numeric Features Read More »

Mirascope: Extract Structured Data Extraction from LLM Outputs

Leave a Comment / Feature Engineer, LLM Tools / Khuyen Tran

Large Language Models (LLMs) are powerful at producing human-like text, but their outputs lack structure, which can limit their usefulness in many practical applications that require organized data.

Mirascope offers a solution by enabling the extraction of structured information from LLM outputs reliably.

The following code uses Mirascope to extract meeting details such as topic, date, time, and participants.

Link to Mirascope.

Mirascope: Extract Structured Data Extraction from LLM Outputs Read More »

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation

Leave a Comment / Feature Engineer, Machine Learning Tools / Khuyen Tran

The miceforest library is a Python tool for imputing missing data in a dataset using an iterative series of predictive models. In each iteration, every variable with missing values is imputed using the other variables. The iterations proceed until convergence appears to have been met.

In the example above, the correlation between A and B is brought much closer to the original data after imputing A using B and C, and then imputing B using A and C.

Link to miceforest.

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation Read More »

FunctionTransformer: Build Robust Preprocessing Pipelines with Custom Transformations

Leave a Comment / Feature Engineer / Khuyen Tran

If you want to construct a transformer from an arbitrary callable, use the FunctionTransformer class in scikit-learn.

The FunctionTransformer enables integrating your custom function seamlessly into scikit-learn’s pipeline framework, making it easier to build complex preprocessing workflows and ensure consistent application of transformations across different datasets.

FunctionTransformer: Build Robust Preprocessing Pipelines with Custom Transformations Read More »

Galatic: Clean and Analyze Massive Text Datasets

Leave a Comment / Analyze Data, Feature Engineer, Natural Language Processing / Khuyen Tran

If you want to clean, gain insights, and create embeddings from massive unstructured text datasets, use Galatic.

Link to Galatic.

Galatic: Clean and Analyze Massive Text Datasets Read More »

Enhancing Data Handling with scikit-learn’s DataFrame Support

Leave a Comment / Feature Engineer, Pandas / Khuyen Tran

By default, scikit-learn transformers return a NumPy array. This can pose a challenge if a pandas DataFrame is required for subsequent data processing steps.

Luckily, as of scikit-learn version 1.3.2, you can use the set_output method to obtain the results as a pandas DataFrame.

This method is not limited to individual transformers but can also be applied within a scikit-learn pipeline.

Enhancing Data Handling with scikit-learn’s DataFrame Support Read More »

Feature Engineer

Handling Imbalanced Datasets with imbalanced-learn

Automated Misspelling Correction in Datasets Using skrub

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Upgini: Transform Raw Text into Enriched Numeric Features

Mirascope: Extract Structured Data Extraction from LLM Outputs

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation

FunctionTransformer: Build Robust Preprocessing Pipelines with Custom Transformations

Galatic: Clean and Analyze Massive Text Datasets

Enhancing Data Handling with scikit-learn’s DataFrame Support

yarl: Create and Extract Elements From a URL Using Python

Get in touch

Join the Newsletter

Follow Us on Social Media

Feature Engineer

Work with Khuyen Tran

Work with Khuyen Tran