Similarity Encoding for Dirty Categories Using dirty_cat

When encoding categorical variables, you might want to capture the similarities among these categories such as ‘Master Police Officer’ and ‘Police Officer III’. If so, use dirty-cat.

In the code above, I use dirty-cat’s SimilarityEncoder to encode the titles while capturing their similarities.

The correlation matrix shows how similar two labels are using the encoded values. We can see that the similarity between ‘Master Police Officer’ and ‘Police Officer III’ is 0.86.

Link to dirty-cat.

Link to my full article about dirty-cat.

Scroll to Top