Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Feature Engineer

Handling Imbalanced Datasets with imbalanced-learn

When training a sensitive classifier using an imbalanced dataset, it might work well on the majority class but work poorly on the minority class.

To deal with an imbalanced dataset, we can use imbalanced-learn to generate new samples in the classes which are under-represented. 

The image above shows an imbalanced dataset before and after using RandomOverSampler from imbalanced-learn.

Handling Imbalanced Datasets with imbalanced-learn Read More »

Automated Misspelling Correction in Datasets Using skrub

Real-world datasets often contain misspellings, particularly in manually entered categorical variables.

To merge multiple variants of the same category, use skub’s deduplicate function, which:

🔹 Measures the distance between strings and groups similar strings together
🔹 Replaces the strings in each group with the most common string

Automated Misspelling Correction in Datasets Using skrub Read More »

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Time series data is unique because it has a temporal order. This means that data from the future shouldn’t influence predictions about the past. However, standard cross-validation techniques like K-Fold randomly shuffle the data, potentially using future information to predict past events.

scikit-learn’s TimeSeriesSplit is a specialized cross-validator for time series data. It respects the temporal order of our data, ensuring that we always train on past data and test on future data.

Let’s explore how to use TimeSeriesSplit with a simple example:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(n_splits=3)

for i, (train_index, test_index) in enumerate(tscv.split(X)):
print(f"Fold {i}:")
print(f" Train: index={train_index}")
print(f" Test: index={test_index}")

Fold 0:
Train: index=[0 1 2]
Test: index=[3]
Fold 1:
Train: index=[0 1 2 3]
Test: index=[4]
Fold 2:
Train: index=[0 1 2 3 4]
Test: index=[5]

From the outputs, we can see that:

Temporal Integrity: The split always respects the original order of the data.

Growing Training Set: With each fold, the training set expands to include more historical data.

Forward-Moving Test Set: The test set is always a single future sample, progressing with each fold.

No Data Leakage: Future information is never used to predict past events.

This approach mimics real-world forecasting scenarios, where models use historical data to predict future outcomes.
Favorite

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit Read More »

Upgini: Transform Raw Text into Enriched Numeric Features

Raw text data can lack the necessary context and factual details required for robust machine learning models. 

Upgini can automatically enrich any text fields with relevant facts from external data sources and generate ready-to-use numeric features from these enriched representations.

Link to Upgini.
Favorite

Upgini: Transform Raw Text into Enriched Numeric Features Read More »

Mirascope: Extract Structured Data Extraction from LLM Outputs

Large Language Models (LLMs) are powerful at producing human-like text, but their outputs lack structure, which can limit their usefulness in many practical applications that require organized data.

Mirascope offers a solution by enabling the extraction of structured information from LLM outputs reliably.

The following code uses Mirascope to extract meeting details such as topic, date, time, and participants.

Link to Mirascope.
Favorite

Mirascope: Extract Structured Data Extraction from LLM Outputs Read More »

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation

The miceforest library is a Python tool for imputing missing data in a dataset using an iterative series of predictive models. In each iteration, every variable with missing values is imputed using the other variables. The iterations proceed until convergence appears to have been met.

In the example above, the correlation between A and B is brought much closer to the original data after imputing A using B and C, and then imputing B using A and C.

Link to miceforest.
Favorite

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation Read More »

FunctionTransformer: Build Robust Preprocessing Pipelines with Custom Transformations

If you want to construct a transformer from an arbitrary callable, use the FunctionTransformer class in scikit-learn.

The FunctionTransformer enables integrating your custom function seamlessly into scikit-learn’s pipeline framework, making it easier to build complex preprocessing workflows and ensure consistent application of transformations across different datasets.
Favorite

FunctionTransformer: Build Robust Preprocessing Pipelines with Custom Transformations Read More »

Enhancing Data Handling with scikit-learn’s DataFrame Support

By default, scikit-learn transformers return a NumPy array. This can pose a challenge if a pandas DataFrame is required for subsequent data processing steps.

Luckily, as of scikit-learn version 1.3.2, you can use the set_output method to obtain the results as a pandas DataFrame.

This method is not limited to individual transformers but can also be applied within a scikit-learn pipeline.

Enhancing Data Handling with scikit-learn’s DataFrame Support Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran