Splink: Fast and Accurate Probabilistic Record Linkage

Khuyen Tran

Matching and deduplicating records across multiple datasets without unique identifiers is a time-consuming and error-prone process.

Splink solves this problem with probabilistic record linkage, enabling you to deduplicate and link records quickly and accurately.

Key benefits include:

Fast processing: Link 1 million records on a laptop in just 1 minute
High accuracy: Advanced term frequency adjustments and customizable fuzzy matching logic
Unsupervised learning: No training data required
Interactive outputs: Explore and diagnose linkage issues with intuitive visualizations

Historical People: Quick and Dirty Record Linkage Example

This example demonstrates how to obtain initial record linkage results as quickly as possible using the Splink library.

Importing Libraries and Loading Data

from splink.datasets import splink_datasets
from splink import block_on, SettingsCreator
import splink.comparison_library as cl
from splink import Linker, DuckDBAPI

# Load the historical 50k dataset
df = splink_datasets.historical_50k
df.head(5)

unique_id	full_name	first_and_surname	first_name	surname	dob	birth_place	postcode_fake	occupation
Q2296770-1	thomas clifford, 1st baron clifford of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	politician
Q2296770-2	thomas of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	politician
Q2296770-3	tom 1st baron clifford of chudleigh	tom chudleigh	tom	chudleigh	1630-08-01	devon	tq13 8df	politician
Q2296770-4	thomas 1st chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8hu	politician
Q2296770-5	thomas clifford, 1st baron chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	politician

Defining Settings

# Define the settings for the record linkage model
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("full_name"),
        block_on("substr(full_name,1,6)", "dob", "birth_place"),
        block_on("dob", "birth_place"),
        block_on("postcode_fake"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison(
            "first_name",
            "surname",
            forename_surname_concat_col_name="first_and_surname",
        ),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.LevenshteinAtThresholds("postcode_fake", 2),
        cl.JaroWinklerAtThresholds("birth_place", 0.9).configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
    ],
)

Creating a Linker and Estimating Probabilities

# Create a linker object
linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)

# Define deterministic rules
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

# Estimate the probability of two random records matching
linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.6
)

# Estimate the u probability using random sampling
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)

Predicting Matches

# Predict matches with a threshold match probability of 0.9
results = linker.inference.predict(threshold_match_probability=0.9)

Displaying Results

# Display the results as a pandas dataframe
results.as_pandas_dataframe(limit=5)

This will output the first 5 rows of the results dataframe, which contains information about the matched records, including the match weight, match probability, and the values of the compared columns.

Link to Splink.

Simplify SQL Parsing and Transpilation with SQLGlot

April 15, 2025

Combine SQL and Python Efficiently with Ibis

April 2, 2025

Use PySpark UDFs to Make SQL Logic Reusable

March 18, 2025

Splink: Fast and Accurate Probabilistic Record Linkage

Table of Contents

Splink: Fast and Accurate Probabilistic Record Linkage

Khuyen Tran

Historical People: Quick and Dirty Record Linkage Example

Importing Libraries and Loading Data

Defining Settings

Creating a Linker and Estimating Probabilities

Predicting Matches

Displaying Results

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Splink: Fast and Accurate Probabilistic Record Linkage

Table of Contents

Splink: Fast and Accurate Probabilistic Record Linkage

Khuyen Tran

Historical People: Quick and Dirty Record Linkage Example

Importing Libraries and Loading Data

Defining Settings

Creating a Linker and Estimating Probabilities

Predicting Matches

Displaying Results

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut