When working with datasets from different sources, it’s common to encounter non-exact matching entries. For instance, a country name might be listed as “Yemen*” in one table and “Yemen, Rep.” in another. In such cases, using a traditional join operation might not yield the desired results.
Let’s consider two tables: the Happiness Report 2022 dataset and the World Bank’s GDP per capita dataset. We want to join these tables in the country name column, but we noticed that the country names don’t match exactly.
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/skrub-data/datasets/master/data/Happiness_report_2022.csv",
thousands=",",
)
df = df[["Country", "Happiness score"]]
from skrub.datasets import fetch_world_bank_indicator
gdppc = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X
Let’s take a look at the last few rows of each table to see the non-exact matching entries:
print(df.sort_values(by="Country").tail(6))
Country Happiness score
107 Venezuela 4925.0
76 Vietnam 5485.0
131 Yemen* 4197.0
136 Zambia 3760.0
143 Zimbabwe 2995.0
146 xx NaN
print(gdppc.sort_values(by="Country Name").tail(5))
Country Name GDP per capita (current US$)
193 West Bank and Gaza 3789.327966
255 World 12647.480789
258 Yemen, Rep. 701.714878
260 Zambia 1487.907764
261 Zimbabwe 1266.996031
Introducing Skrub’s fuzzy_join()
Skrub’s fuzzy_join()
function is designed to handle non-exact matching entries by performing a fuzzy match between the join key columns. This allows you to join tables even when the join keys don’t match exactly.
Now, let’s use Skrub’s fuzzy_join()
function to join the two tables on the country name column:
from skrub import fuzzy_join
df1 = fuzzy_join(
df, # our table to join
gdppc, # the table to join with
left_on="Country", # the first join key column
right_on="Country Name", # the second join key column
return_score=True,
)
print(df1[['Country', 'Country Name', "matching_score"]].tail(20))
Country Country Name matching_score
127 Madagascar* Madagascar 0.795045
128 Egypt Egypt, Arab Rep. 0.654033
129 Chad* Chad 0.683373
130 Ethiopia Ethiopia 1.000000
131 Yemen* Yemen, Rep. 0.653668
132 Mauritania* Mauritania 0.810736
133 Jordan Jordan 1.000000
134 Togo Togo 1.000000
135 India India 1.000000
136 Zambia Zambia 1.000000
137 Malawi Malawi 1.000000
138 Tanzania Tanzania 1.000000
139 Sierra Leone Sierra Leone 1.000000
140 Lesotho* Lesotho 0.755238
141 Botswana* Botswana 0.795825
142 Rwanda* Rwanda 0.754604
143 Zimbabwe Zimbabwe 1.000000
144 Lebanon Lebanon 1.000000
145 Afghanistan Afghanistan 1.000000
146 xx East Asia & Pacific (excluding high income) 0.500000
The resulting table shows the joined data with a matching score indicating the similarity between the join key values.
By using Skrub’s fuzzy_join()
function, we can effectively join tables with non-exact matching entries and perform analysis on the combined data.