Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Splink: Fast and Accurate Probabilistic Record Linkage

Table of Contents

Splink: Fast and Accurate Probabilistic Record Linkage

Matching and deduplicating records across multiple datasets without unique identifiers is a time-consuming and error-prone process.

Splink solves this problem with probabilistic record linkage, enabling you to deduplicate and link records quickly and accurately.

Key benefits include:

  • Fast processing: Link 1 million records on a laptop in just 1 minute
  • High accuracy: Advanced term frequency adjustments and customizable fuzzy matching logic
  • Unsupervised learning: No training data required
  • Interactive outputs: Explore and diagnose linkage issues with intuitive visualizations

Historical People: Quick and Dirty Record Linkage Example

This example demonstrates how to obtain initial record linkage results as quickly as possible using the Splink library.

Importing Libraries and Loading Data

from splink.datasets import splink_datasets
from splink import block_on, SettingsCreator
import splink.comparison_library as cl
from splink import Linker, DuckDBAPI

# Load the historical 50k dataset
df = splink_datasets.historical_50k
df.head(5)
unique_idfull_namefirst_and_surnamefirst_namesurnamedobbirth_placepostcode_fakeoccupation
Q2296770-1thomas clifford, 1st baron clifford of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfpolitician
Q2296770-2thomas of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfpolitician
Q2296770-3tom 1st baron clifford of chudleightom chudleightomchudleigh1630-08-01devontq13 8dfpolitician
Q2296770-4thomas 1st chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8hupolitician
Q2296770-5thomas clifford, 1st baron chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfpolitician

Defining Settings

# Define the settings for the record linkage model
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("full_name"),
        block_on("substr(full_name,1,6)", "dob", "birth_place"),
        block_on("dob", "birth_place"),
        block_on("postcode_fake"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison(
            "first_name",
            "surname",
            forename_surname_concat_col_name="first_and_surname",
        ),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.LevenshteinAtThresholds("postcode_fake", 2),
        cl.JaroWinklerAtThresholds("birth_place", 0.9).configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
    ],
)

Creating a Linker and Estimating Probabilities

# Create a linker object
linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)

# Define deterministic rules
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

# Estimate the probability of two random records matching
linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.6
)

# Estimate the u probability using random sampling
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)

Predicting Matches

# Predict matches with a threshold match probability of 0.9
results = linker.inference.predict(threshold_match_probability=0.9)

Displaying Results

# Display the results as a pandas dataframe
results.as_pandas_dataframe(limit=5)

This will output the first 5 rows of the results dataframe, which contains information about the matched records, including the match weight, match probability, and the values of the compared columns.

Link to Splink.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran