Feature Engineer Archives

Encode Rare Labels with Feature-engine

Leave a Comment / Feature Engineer / Khuyen Tran

When dealing with features with high cardinality, you might want to mark the rare categories as “Other”. Feature-engine’s RareLabelEncoder makes it easy for you to do so.

In the code above, we use RareLabelEncoder to replace categories with the frequency below 0.05 in the column “education” with “Other”.

Link to Feature-engine.

My previous tips on feature engineering.

Encode Rare Labels with Feature-engine Read More »

Split Data in a Stratified Fashion in scikit-learn

Leave a Comment / Feature Engineer / Khuyen Tran

When using scikit-learn’s train_test_split, if you want to keep the proportion of classes in the sample the same as the proportion of classes in the entire dataset, use stratify=y.

My previous tips on feature engineering in Python.

Split Data in a Stratified Fashion in scikit-learn Read More »

Return a DataFrame When Using a scikit-learn’s Transformer

Leave a Comment / Feature Engineer, Pandas / Khuyen Tran

Applying a scikit-learn’s transformer on your DataFrame will return a NumPy array. If you want to return a pandas DataFrame instead, use SklearnTransformerWrapper along with your scikit-learn’s transformer.

This is a method of feature-engine.

My previous tips on tools for feature engineer.

Return a DataFrame When Using a scikit-learn’s Transformer Read More »

yarl: Build a URL Using Python

Leave a Comment / Feature Engineer / Khuyen Tran

If you want to use Python to quickly build a URL using information such as scheme, host, port, path, query, fragment, etc, try yarl.

yarl also makes it easy to replace one part with another, such as a query.

Link to yarl.

My previous tips on feature extraction.

yarl: Build a URL Using Python Read More »

Similarity Encoding for Dirty Categories Using dirty_cat

Leave a Comment / Feature Engineer / Khuyen Tran

When encoding categorical variables, you might want to capture the similarities among these categories such as ‘Master Police Officer’ and ‘Police Officer III’. If so, use dirty-cat.

In the code above, I use dirty-cat’s SimilarityEncoder to encode the titles while capturing their similarities.

The correlation matrix shows how similar two labels are using the encoded values. We can see that the similarity between ‘Master Police Officer’ and ‘Police Officer III’ is 0.86.

Link to dirty-cat.

Link to my full article about dirty-cat.

Feature-engine: Drop Correlated Features

Leave a Comment / Feature Engineer / Khuyen Tran

If you want to remove the correlated variables from a dataframe, use feature_engine.DropCorrelatedFeatures.

In the code above, I drop the variables with a correlation above 0.8.

Link to feature-engine.

Google Colab notebook of the code snippet above.

Feature-engine: Drop Correlated Features Read More »

Datacommons: Get Statistics about a Location in One Line of Code

Leave a Comment / Feature Engineer / Khuyen Tran

If you want to get some interesting statistics about a location in one line of code, try Datacommons.Datacommons is a publicly available data from open sources (census.gov, cdc.gov, data.gov, etc.). In the example above, I used Datacommons to get the median income in California over time.

Find other interesting statistics using Datacommons here.

Link to Datacommons.

Datacommons: Get Statistics about a Location in One Line of Code Read More »

squared=False: Get RMSE from Sklearn’s mean_squared_error method

Leave a Comment / Feature Engineer / Khuyen Tran

If you want to get the root mean squared error using sklearn, pass squared=False to sklearn’s mean_squared_error method.
The code above shows an example of this method.
Link to the source code.

squared=False: Get RMSE from Sklearn’s mean_squared_error method Read More »

fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality

Leave a Comment / Feature Engineer / Khuyen Tran

To get a DataFrame’s continuous and categorical variables based on their cardinality, use fastai’s cont_cat_split method.
If a column consists of integers, but its cardinality is smaller than the max_card parameter, it is considered as a category variable.
Find an example of this method above.
Link to the source code.
Link to the documentation.

fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality Read More »

fastai’s df_shrink: Shrink DataFrame’s Memory Usage in One Line of Code

Leave a Comment / Feature Engineer / Khuyen Tran

Changing data types of DataFrame columns to smaller data types can significantly reduce the memory usage of the DataFrame.
Instead of manually choosing smaller data types, is there a way that you can automatically change data types in one line of code?
That is when the df_shrink method of fastai comes in handy. In the code above, the memory usage of the DataFrame decreases from 200 bytes to 146 bytes,
Learn more about df_shrink here.
Link to the source code.

fastai’s df_shrink: Shrink DataFrame’s Memory Usage in One Line of Code Read More »

Feature Engineer

Encode Rare Labels with Feature-engine

Split Data in a Stratified Fashion in scikit-learn

Return a DataFrame When Using a scikit-learn’s Transformer

yarl: Build a URL Using Python

Similarity Encoding for Dirty Categories Using dirty_cat

Feature-engine: Drop Correlated Features

Datacommons: Get Statistics about a Location in One Line of Code

squared=False: Get RMSE from Sklearn’s mean_squared_error method

fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality

fastai’s df_shrink: Shrink DataFrame’s Memory Usage in One Line of Code

Drop a line

Get in touch

Follow Us on Social Media

Feature Engineer

Work with Khuyen Tran

Work with Khuyen Tran