text-processing

Auto-created tag for text-processing

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

4 Comments / Blog, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Dataset Generation
Simple Regex: Basic Pattern Extraction
pregex: Build Readable Patterns
pyparsing: Parse Structured Ticket Headers
Conclusion

Introduction
Imagine you’re analyzing customer support tickets to extract contact information and error details. Tickets contain customer messages with email addresses in various formats, phone numbers with inconsistent formatting (some (555) 123-4567, others 555-123-4567).

ticket_id
message

0
Contact me at nichole70@kemp.com or (798)034-325 to resolve this issue.

1
You can reach me by phone (970-295-1452) or email (russellbrandon@simon-rogers.org) anytime.

2
My contact details: ehamilton@silva.io and 242 844 7293.

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.

How do you extract the email addresses and phone numbers from the tickets?
This article shows three approaches to text pattern matching: regex, pregex, and pyparsing.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Key Takeaways
Here’s what you’ll learn:

Understand when regex patterns are sufficient and when they fall short
Write maintainable text extraction code using pregex’s readable components
Parse structured text with inconsistent formatting using pyparsing

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Dataset Generation
Let’s create sample datasets that will be used throughout the article. We’ll generate customer support ticket data using the Faker library:
Install Faker:
pip install faker

First, let’s generate customer support tickets with simple contact information:
from faker import Faker
import csv
import pandas as pd
import random

fake = Faker()
Faker.seed(40)

# Define phone patterns
phone_patterns = ["(###)###-####", "###-###-####", "### ### ####", "###.###.####"]

# Define email TLDs
email_tlds = [".com", ".org", ".io", ".net"]

# Generate phone numbers and emails
phones = []
emails = []

for i in range(4):
# Generate phone with specific pattern
phone = fake.numerify(text=phone_patterns[i])
phones.append(phone)

# Generate email with specific TLD
email = fake.user_name() + "@" + fake.domain_word() + email_tlds[i]
emails.append(email)

# Define sentence structures
sentence_structures = [
lambda p, e: f"Contact me at {e} or {p} to resolve this issue.",
lambda p, e: f"You can reach me by phone ({p}) or email ({e}) anytime.",
lambda p, e: f"My contact details: {e} and {p}.",
lambda p, e: f"Feel free to call {p} or email {e} for assistance."
]

# Create CSV with 4 rows
with open("data/tickets.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["ticket_id", "message"])

for i in range(4):
message = sentence_structures[i](phones[i], emails[i])
writer.writerow([i, message])

Set the display option to show the full width of the columns:
pd.set_option("display.max_colwidth", None)

Load and preview the tickets dataset:
df_tickets = pd.read_csv("data/tickets.csv")
df_tickets.head()

ticket_id
message

0
Contact me at nichole70@kemp.com or (798)034-325 to resolve this issue.

1
You can reach me by phone (970-295-1452) or email (russellbrandon@simon-rogers.org) anytime.

2
My contact details: ehamilton@silva.io and 242 844 7293.

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.

Simple Regex: Basic Pattern Extraction
Regular expressions (regex) are patterns that match text based on rules. They excel at finding structured data like emails, phone numbers, and dates in unstructured text.
Extract Email Addresses
Start with a simple pattern that matches basic email formats, including:

Username: [a-z]+ – One or more lowercase letters (e.g. maria95)
Separator: @ – Literal @ symbol
Domain: [a-z]+ – One or more lowercase letters (e.g. gmail or outlook)
Dot: \. – Literal dot (escaped)
Extension: (?:org|net|com|io) – Match specific extensions (e.g. .com, .org, .io, .net)

import re

# Match basic email format: letters@domain.extension
email_pattern = r'[a-z]+@[a-z]+\.(?:org|net|com|io)'

df_tickets['emails'] = df_tickets['message'].apply(
lambda x: re.findall(email_pattern, x)
)

df_tickets[['message', 'emails']].head()

message
emails

0
Contact me at nichole70@kemp.com or (798)034-325 to resolve this issue.

1
You can reach me by phone (970-295-1452) or email (russellbrandon@simon-rogers.org) anytime.

2
My contact details: ehamilton@silva.io and 242 844 7293.

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.

This pattern works for simple emails but misses variations with:

Other characters in the username such as numbers, dots, underscores, plus signs, or hyphens
Other characters in the domain such as numbers, dots, or hyphens
Other extensions that are not .com, .org, .io, or .net

Let’s expand the pattern to handle more formats:
# Handle emails with numbers, dots, underscores, hyphens, plus signs
improved_email = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

df_tickets['emails_improved'] = df_tickets['message'].apply(
lambda x: re.findall(improved_email, x)
)

df_tickets[['message', 'emails_improved']].head()

message
emails_improved

0
Contact me at nichole70@kemp.com or (798)034-325 to resolve this issue.

1
You can reach me by phone (970-295-1452) or email (russellbrandon@simon-rogers.org) anytime.

2
My contact details: ehamilton@silva.io and 242 844 7293.

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.

The improved pattern successfully extracts all emails from the tickets! Let’s move on to extracting phone numbers.
Extract Phone Numbers
Common phone number formats are:

(XXX)XXX-XXXX – With parentheses
XXX-XXX-XXXX – Without parentheses
XXX XXX XXXX – With spaces
XXX.XXX.XXXX – With dots

To handle all four phone formats, we can use the following pattern:

\(? – Optional opening parenthesis
\d{3} – Exactly 3 digits (area code)
[-.\s]? – Optional hyphen, dot, or space
\)? – Optional closing parenthesis
\d{3} – Exactly 3 digits (prefix)
[-.\s]? – Optional hyphen, dot, or space
\d{3,4} – Exactly 3 or 4 digits

# Define phone pattern
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]\d{4}'

df_tickets['phones'] = df_tickets['message'].apply(
lambda x: re.findall(phone_pattern, x)
)

df_tickets[['message', 'phones']].head()

message
phones

0
Contact me at hfuentes@anderson.com or (798)034-3254 to resolve this issue.

1
You can reach me by phone (702-951-4528) or email (russellbrandon@simon-rogers.org) anytime.

2
My contact details: ehamilton@silva.io and 242 844 7293.

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.

Awesome! We are able to extract all phone numbers from the tickets!
While these patterns works, they are difficult to understand and modify for someone who is not familiar with regex.

📖 Readable code reduces maintenance burden and improves team productivity. Check out Production-Ready Data Science for detailed guidance on writing production-quality code.

In the next section, we will use pregex to build more readable patterns.
pregex: Build Readable Patterns
pregex is a Python library that lets you build regex patterns using readable Python syntax instead of regex symbols. It breaks complex patterns into self-documenting components that clearly express validation logic.
Install pregex:
pip install pregex

Extract Email Addresses
Let’s extract emails using pregex’s readable components.
In the code, we will use the following components:

Username: OneOrMore(AnyButWhitespace()) – Any letters but whitespace (maria95)
Separator: @ – Literal @ symbol
Domain name: OneOrMore(AnyButWhitespace()) – Any letters but whitespace (gmail or outlook)
Extension: Either(".com", ".org", ".io", ".net") – Match specific extensions (.com, .org, .io, .net)

from pregex.core.classes import AnyButWhitespace
from pregex.core.quantifiers import OneOrMore
from pregex.core.operators import Either

username = OneOrMore(AnyButWhitespace())
at_symbol = "@"
domain_name = OneOrMore(AnyButWhitespace())
extension = Either(".com", ".org", ".io", ".net")

email_pattern = username + at_symbol + domain_name + extension

# Extract emails
df_tickets["emails_pregex"] = df_tickets["message"].apply(
lambda x: email_pattern.get_matches(x)
)

df_tickets[["message", "emails_pregex"]].head()

message
emails_pregex

0
Contact me at hfuentes@anderson.com or (798)034-3254 to resolve this issue.
[hfuentes@anderson.com]

1
You can reach me by phone (702-951-4528) or email (russellbrandon@simon-rogers.org) anytime.
[(russellbrandon@simon-rogers.org]

2
My contact details: ehamilton@silva.io and 242 844 7293.
[ehamilton@silva.io]

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.
[ogarcia@howell-chavez.net]

The output shows that we are able to extract the emails from the tickets!
pregex transforms pattern matching from symbol decoding into readable code. OneOrMore(username_chars) communicates intent more clearly than [a-zA-Z0-9._%+-]+, reducing the time teammates spend understanding and modifying validation logic.
Extract Phone Numbers
Now extract phone numbers with multiple components:

First three digits: Optional("(") + Exactly(AnyDigit(), 3) + Optional(")")
Separator: Either(" ", "-", ".")
Second three digits: Exactly(AnyDigit(), 3)
Last four digits: Exactly(AnyDigit(), 4)

from pregex.core.classes import AnyDigit
from pregex.core.quantifiers import Optional, Exactly
from pregex.core.operators import Either

# Build phone pattern using pregex
first_three = Optional("(") + Exactly(AnyDigit(), 3) + Optional(")")
separator = Either(" ", "-", ".")
second_three = Exactly(AnyDigit(), 3)
last_four = Exactly(AnyDigit(), 4)

phone_pattern = first_three + Optional(separator) + second_three + separator + last_four

# Extract phone numbers
df_tickets['phones_pregex'] = df_tickets['message'].apply(
lambda x: phone_pattern.get_matches(x)
)

df_tickets[['message', 'phones_pregex']].head()

message
phones_pregex

0
Contact me at hfuentes@anderson.com or (798)034-3254 to resolve this issue.
[(798)034-3254]

1
You can reach me by phone (702-951-4528) or email (russellbrandon@simon-rogers.org) anytime.
[(702-951-4528]

2
My contact details: ehamilton@silva.io and 242 844 7293.
[242 844 7293]

3
Feel free to call 901.794.1337 or email ogarcia@howell-chavez.net for assistance.
[901.794.1337]

If your system requires the raw regex pattern, you can get it with get_compiled_pattern():
print("Compiled email pattern:", email_pattern.get_compiled_pattern().pattern)
print("Compiled phone pattern:", phone_pattern.get_compiled_pattern().pattern)

Compiled email pattern: \S+@\S+(?:\.com|\.org|\.io|\.net)
Compiled phone pattern: \(?\d{3}\)?(?: |-|\.)?\d{3}(?: |-|\.)\d{4}

For more pregex examples including URLs and time patterns, see PRegEx: Write Human-Readable Regular Expressions in Python.

Parse Structured Ticket Headers
Now let’s tackle a more complex task: parsing structured ticket headers that contain multiple fields:
Ticket: 1000 | Priority: High | Assigned: John Doe # escalated

We will use Capture to extract just the values we need from each ticket:
from pregex.core.quantifiers import OneOrMore
from pregex.core.classes import AnyDigit, AnyLetter, AnyWhitespace
from pregex.core.groups import Capture

sample_ticket = "Ticket: 1000 | Priority: High | Assigned: John Doe # escalated"

# Define patterns with Capture to extract just the values
whitespace = AnyWhitespace()
ticket_id_pattern = "Ticket:" + whitespace + Capture(OneOrMore(AnyDigit()))
priority_pattern = "Priority:" + whitespace + Capture(OneOrMore(AnyLetter()))
name_pattern = (
"Assigned:"
+ whitespace
+ Capture(OneOrMore(AnyLetter()) + " " + OneOrMore(AnyLetter()))
)

# Define separator pattern (whitespace around pipe)
separator = whitespace + "|" + whitespace

# Combine all patterns with separators
ticket_pattern = (
ticket_id_pattern
+ separator
+ priority_pattern
+ separator
+ name_pattern
)
“`text
Next, define a function to extract the ticket components from the captured components:

“`python
def get_ticket_components(ticket_string, ticket_pattern):
"""Extract ticket components from a ticket string."""
try:
captures = ticket_pattern.get_captures(ticket_string)[0]
return pd.Series(
{
"ticket_id": captures[0],
"priority": captures[1],
"assigned": captures[2],
}
)
except IndexError:
return pd.Series(
{"ticket_id": None, "priority": None, "assigned": None}
)

Apply the function with the pattern defined above to the sample ticket.

components = get_ticket_components(sample_ticket, ticket_pattern)
print(components.to_dict())

{'ticket_id': '1000', 'priority': 'High', 'assigned': 'John Doe'}

This looks good! Let’s apply to ticket headers with inconsistent whitespace around the separators. Start by creating the dataset:
import pandas as pd

df_tickets = pd.DataFrame({'ticket': tickets})
df_tickets.head()

ticket

# Extract individual components using the function
df_pregex = df_tickets.copy()
components_df = df_pregex["ticket"].apply(get_ticket_components, ticket_pattern=ticket_pattern)

df_pregex = df_pregex.assign(**components_df)

df_pregex[["ticket_id", "priority", "assigned"]].head()

ticket_id
priority
assigned

0
1000
High
John Doe

1
None
None
None

2
None
None
None

3
1003
High
Bob Johnson

We can see that pregex misses Tickets 1 and 2 because AnyWhitespace() only matches a single space, while those rows use inconsistent spacing around the separators.
Making pregex patterns flexible enough for variable formatting requires adding optional quantifiers to the whitespace pattern so that it can match zero or more spaces around the separators.
As these fixes accumulate, pregex’s readability advantage diminishes, and you end up with code that’s as hard to understand as raw regex but more verbose.
When parsing structured data with consistent patterns but varying details, pyparsing provides more robust handling than regex.
pyparsing: Parse Structured Ticket Headers
Unlike regex’s pattern matching approach, pyparsing lets you define grammar rules using Python classes, making parsing logic explicit and maintainable.
Install pyparsing:
pip install pyparsing

Let’s parse the complete structure with pyparsing, including:

Ticket ID: Word(nums) – One or more digits (e.g. 1000)
Priority: Word(alphas) – One or more letters (e.g. High)
Name: Word(alphas) + Word(alphas) – First and last name (e.g. John Doe)

We will also use the pythonStyleComment to ignore Python-style comments throughout parsing.
from pyparsing import Word, alphas, nums, Literal, pythonStyleComment

# Define grammar components
ticket_num = Word(nums)
priority = Word(alphas)
name = Word(alphas) + Word(alphas)

# Define complete structure
ticket_grammar = (
"Ticket:"
+ ticket_num
+ "|"
+ "Priority:"
+ priority
+ "|"
+ "Assigned:"
+ name
)

# Automatically ignore Python-style comments throughout parsing
ticket_grammar.ignore(pythonStyleComment)

sample_ticket = "Ticket: 1000 | Priority: High | Assigned: John Doe # escalated"
sample_result = ticket_grammar.parse_string(sample_ticket)
print(sample_result)

['Ticket:', '1000', '|', 'Priority:', 'High', '|', 'Assigned:', 'John', 'Doe']

Awesome! We are able to extract the ticket components from the ticket with a much simpler pattern!
Compare this to the pregex implementation:
ticket_pattern = (
"Ticket:" + whitespace + Capture(OneOrMore(AnyDigit()))
+ whitespace + "|" + whitespace
+ "Priority:" + whitespace + Capture(OneOrMore(AnyLetter()))
+ whitespace + "|" + whitespace
+ "Assigned:"
+ whitespace
+ Capture(OneOrMore(AnyLetter()) + " " + OneOrMore(AnyLetter()))
)

We can see that pyparsing handles structured data better than pregex for the following reasons:

No whitespace boilerplate: pyparsing handles spacing automatically while pregex requires + whitespace + between every component
Self-documenting: Word(alphas) clearly means “letters” while pregex’s nested Capture(OneOrMore(AnyLetter())) is less readable

To extract ticket components, assign names using () syntax and access them via dot notation:
# Define complete structure
ticket_grammar = (
"Ticket:"
+ ticket_num("ticket_id")
+ "|"
+ "Priority:"
+ priority("priority")
+ "|"
+ "Assigned:"
+ name("assigned")
)

# Automatically ignore Python-style comments throughout parsing
ticket_grammar.ignore(pythonStyleComment)

sample_ticket = "Ticket: 1000 | Priority: High | Assigned: John Doe # escalated"
sample_result = ticket_grammar.parse_string(sample_ticket)

# Access the components by name
print(
f"Ticket ID: {sample_result.ticket_id}",
f"Priority: {sample_result.priority}",
f"Assigned: {' '.join(sample_result.assigned)}",
)

Ticket ID: 1000 Priority: High Assigned: John Doe

Let’s apply this to the entire dataset.
# Parse all tickets and create columns
def parse_ticket(ticket, ticket_grammar):
result = ticket_grammar.parse_string(ticket)
return pd.Series(
{
"ticket_id": result.ticket_id,
"priority": result.priority,
"assigned": " ".join(result.assigned),
}
)

df_pyparsing = df_tickets.copy()
components_df_pyparsing = df_pyparsing["ticket"].apply(parse_ticket, ticket_grammar=ticket_grammar)
df_pyparsing = df_pyparsing.assign(**components_df_pyparsing)

df_pyparsing[["ticket_id", "priority", "assigned"]].head()

ticket_id
priority
assigned

0
1000
High
John Doe

1
1001
Medium
Maria Garcia

2
1002
Low
Alice Smith

3
1003
High
Bob Johnson

The output looks good!
Let’s try to parse some more structured data with pyparsing.
Extract Code Blocks from Markdown
Use SkipTo to extract Python code between code block markers without complex regex patterns like r'“`python(.*?)“`':
from pyparsing import Literal, SkipTo

code_start = Literal("“`python")
code_end = Literal("“`")

code_block = code_start + SkipTo(code_end)("code") + code_end

markdown = """“`python
def hello():
print("world")
“`"""

result = code_block.parse_string(markdown)
print(result.code)

def hello():
print("world")

Parse Nested Structures
nested_expr handles arbitrary nesting depth, which regex fundamentally cannot parse:
from pyparsing import nested_expr

# Default: parentheses
nested_list = nested_expr()
result = nested_list.parse_string("((2 + 3) * (4 – 1))")
print(result.as_list())

[[['2', '+', '3'], '*', ['4', '-', '1']]]

Conclusion
So how do you know when to use each tool? Choose your tool based on your needs:
Use simple regex when:

Extracting simple, well-defined patterns (emails, phone numbers with consistent format)
Pattern won’t need frequent modifications

Use pregex when:

Pattern has multiple variations (different phone number formats)
Need to document pattern logic through readable code

Use pyparsing when:

Need to extract multiple fields from structured text (ticket headers, configuration files)
Must handle variable formatting (inconsistent whitespace, embedded comments)

In summary, start with simple regex, adopt pregex when readability matters, and switch to pyparsing when structure becomes complex.
Related Tutorials
Here are some related text processing tools:

Text similarity matching: 4 Text Similarity Tools: When Regex Isn’t Enough compares regex preprocessing, difflib, RapidFuzz, and Sentence Transformers for matching product names and handling data variations
Business entity extraction: langextract vs spaCy: AI-Powered vs Rule-Based Entity Extraction evaluates regex, spaCy, GLiNER, and langextract for extracting structured information from financial documents

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing Read More »

4 Text Similarity Tools: When Regex Isn’t Enough

2 Comments / Blog, Machine Learning, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Text Preprocessing with regex
difflib: Python’s Built-in Sequence Matching
RapidFuzz: High-Performance Fuzzy String Matching
Sentence Transformers: AI-Powered Semantic Similarity
When to Use Each Tool
Final Thoughts

Introduction
Text similarity is a fundamental challenge in data science. Whether you’re detecting duplicates, clustering content, or building search systems, the core question remains: how do you determine when different text strings represent the same concept?
Traditional exact matching fails with real-world data. Consider these common text similarity challenges:

Formatting variations: “iPhone® 14 Pro Max” vs “IPHONE 14 pro max” – identical products with different capitalization and symbols.
Missing spaces: “iPhone14ProMax” vs “iPhone 14 Pro Max” – same product name, completely different character sequences.
Extra information: “Apple iPhone 14 Pro Max 256GB” vs “iPhone 14 Pro Max” – additional details that obscure the core product.
Semantic equivalence: “wireless headphones” vs “bluetooth earbuds” – different words describing similar concepts.

These challenges require different approaches:

Regex preprocessing cleans formatting inconsistencies
difflib provides character-level similarity scoring
RapidFuzz handles fuzzy matching at scale
Sentence Transformers understands semantic relationships

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Key Takeaways
Here’s what you’ll learn:

Handle 90% of text variations with regex preprocessing and RapidFuzz matching
Achieve 5× faster fuzzy matching compared to difflib with production-grade algorithms
Unlock semantic understanding with Sentence Transformers for conceptual similarity
Navigate decision trees from simple string matching to AI-powered text analysis
Implement scalable text similarity pipelines for real-world data challenges

Text Preprocessing with regex
Raw text data contains special characters, inconsistent capitalization, and formatting variations. Regular expressions provide the first line of defense by normalizing text.
These pattern-matching tools, accessed through Python’s re module, excel at finding and replacing text patterns like symbols, whitespace, and formatting inconsistencies.
Let’s start with a realistic dataset that demonstrates common text similarity challenges:
import re

# Sample messy text data
messy_products = [
"iPhone® 14 Pro Max",
"IPHONE 14 pro max",
"Apple iPhone 14 Pro Max 256GB",
"iPhone14ProMax",
"i-Phone 14 Pro Max",
"Samsung Galaxy S23 Ultra",
"SAMSUNG Galaxy S23 Ultra 5G",
"Galaxy S23 Ultra (512GB)",
"Samsung S23 Ultra",
"wireless headphones",
"bluetooth earbuds",
"Sony WH-1000XM4 Headphones",
"WH-1000XM4 Wireless Headphones",
]

With our test data established, we can build a comprehensive preprocessing function to handle these variations:
def preprocess_product_name(text):
"""Clean product names for better similarity matching."""
# Convert to lowercase
text = text.lower()

# Normalize spaces and hyphens
text = re.sub(r"[-_]+", " ", text)
text = re.sub(r"\s+", " ", text)

# Remove size/capacity info in parentheses
text = re.sub(r"\([^)]*\)", "", text)

return text.strip()

> 📖 **Related**: These regex patterns use traditional syntax for maximum compatibility. For more readable pattern construction, explore [PRegEx for human-friendly regex syntax](https://codecut.ai/pregex-write-human-readable-regular-expressions-in-python-2/).

# Apply preprocessing to sample data
print("Before and after preprocessing:")
print("-" * 50)
for product in messy_products[:8]:
cleaned = preprocess_product_name(product)
print(f"Original: {product}")
print(f"Cleaned: {cleaned}")
print()

Output:
Before and after preprocessing:
————————————————–
Original: iPhone® 14 Pro Max
Cleaned: iphone 14 pro max

Original: IPHONE 14 pro max
Cleaned: iphone 14 pro max

Original: Apple iPhone 14 Pro Max 256GB
Cleaned: apple iphone 14 pro max 256gb

Original: iPhone14ProMax
Cleaned: iphone14promax

Original: i-Phone 14 Pro Max
Cleaned: i phone 14 pro max

Original: Samsung Galaxy S23 Ultra
Cleaned: samsung galaxy s23 ultra

Original: SAMSUNG Galaxy S23 Ultra 5G
Cleaned: samsung galaxy s23 ultra 5g

Original: Galaxy S23 Ultra (512GB)
Cleaned: galaxy s23 ultra

Perfect matches emerge after cleaning formatting inconsistencies. Products 1 and 2 now match exactly, demonstrating regex’s power for standardization.
However, regex preprocessing fails with critical variations. Let’s test exact matching after preprocessing:
# Test exact matching after regex preprocessing
test_cases = [
("iPhone® 14 Pro Max", "IPHONE 14 pro max", "Case + symbols"),
("iPhone® 14 Pro Max", "Apple iPhone 14 Pro Max 256GB", "Extra words"),
("iPhone® 14 Pro Max", "iPhone14ProMax", "Missing spaces"),
("Apple iPhone 14 Pro Max", "iPhone 14 Pro Max Apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap")
]

# Test each case
for product1, product2, issue_type in test_cases:
cleaned1 = preprocess_product_name(product1)
cleaned2 = preprocess_product_name(product2)
is_match = cleaned1 == cleaned2
result = "✓" if is_match else "✗"
print(f"{result} {issue_type}: {is_match}")

Output:
✓ Case + symbols: True
✗ Extra words: False
✗ Missing spaces: False
✗ Word order: False
✗ Semantic gap: False

Regex achieves only 1/5 exact matches despite preprocessing. Success: case and symbol standardization. Failures:

Extra words: “apple iphone” vs “iphone” remain different
Missing spaces: “iphone14promax” vs “iphone 14 pro max” fail matching
Word reordering: Different arrangements of identical words don’t match
Semantic gaps: No shared text patterns between conceptually similar products

These limitations require character-level similarity measurement instead of exact matching. Python’s built-in difflib module provides the solution by analyzing character sequences and calculating similarity ratios.
difflib: Python’s Built-in Sequence Matching
difflib is a Python built-in module that provides similarity ratios. It analyzes character sequences to calculate similarity scores between text strings.
from difflib import SequenceMatcher

def calculate_similarity(text1, text2):
"""Calculate similarity ratio between two strings."""
return SequenceMatcher(None, text1, text2).ratio()

# Test difflib on key similarity challenges
test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("iphone 14 pro max", "iphone14promax", "Missing spaces"),
("iphone 14 pro max", "iphone 14 prro max", "Typo"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap")
]

for text1, text2, test_type in test_cases:
score = calculate_similarity(text1, text2)
result = "✓" if score >= 0.85 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.971
✗ Extra words: 0.739
✓ Missing spaces: 0.903
✓ Typo: 0.971
✗ Word order: 0.739
✗ Semantic gap: 0.333

difflib achieves 4/7 successful matches (≥0.85 threshold). Successes: exact matches, spacing variations, typos, and missing spaces. Failures:

Word reordering: “Apple iPhone” vs “iPhone Apple” drops to 0.739
Extra content: Additional words reduce scores to 0.739
Semantic gaps: Different words for same concept score only 0.333

These results highlight difflib’s core limitation: sensitivity to word order and poor handling of extra content. RapidFuzz tackles word reordering and extra content issues with sophisticated matching algorithms that understand token relationships beyond simple character comparison.
RapidFuzz: High-Performance Fuzzy String Matching
RapidFuzz is a high-performance fuzzy string matching library with C++ optimization. It addresses word reordering and complex text variations that difflib cannot handle effectively.
To install RapidFuzz, run:
pip install rapidfuzz

Let’s test RapidFuzz on the same test cases:
from rapidfuzz import fuzz

# Test RapidFuzz using WRatio algorithm
test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("iphone 14 pro max", "iphone14promax", "Missing spaces"),
("iphone 14 pro max", "iphone 14 prro max", "Typo"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic gap"),
("macbook pro", "laptop computer", "Conceptual gap")
]

for text1, text2, test_type in test_cases:
score = fuzz.WRatio(text1, text2) / 100 # Convert to 0-1 scale
result = "✓" if score >= 0.85 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.971
✓ Extra words: 0.900
✓ Missing spaces: 0.903
✓ Typo: 0.971
✓ Word order: 0.950
✗ Semantic gap: 0.389
✗ Conceptual gap: 0.385

RapidFuzz achieves 6/8 successful matches (≥0.85 threshold). Successes: exact matches, spacing, extra words, missing spaces, typos, and word order. Failures:

Semantic gaps: “wireless headphones” vs “bluetooth earbuds” scores only 0.389
Conceptual relationships: “macbook pro” vs “laptop computer” achieves just 0.385
Pattern-only matching: Cannot understand that different words describe same products

These failures reveal RapidFuzz’s fundamental limitation: it excels at text-level variations but cannot understand meaning. When products serve identical purposes using different terminology, we need semantic understanding rather than pattern matching.
Sentence Transformers addresses this gap through neural language models that comprehend conceptual relationships.
Sentence Transformers: AI-Powered Semantic Similarity
Surface-level text matching misses semantic relationships. Sentence Transformers, a library built on transformer neural networks, can understand that “wireless headphones” and “bluetooth earbuds” serve identical purposes by analyzing meaning rather than just character patterns.
To install Sentence Transformers, run:
pip install sentence-transformers

Let’s test Sentence Transformers on the same test cases:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Test semantic understanding capabilities
model = SentenceTransformer('all-MiniLM-L6-v2')

test_cases = [
("iphone 14 pro max", "iphone 14 pro max", "Exact match"),
("iphone 14 pro max", "i phone 14 pro max", "Spacing variation"),
("iphone 14 pro max", "apple iphone 14 pro max 256gb", "Extra words"),
("apple iphone 14 pro max", "iphone 14 pro max apple", "Word order"),
("wireless headphones", "bluetooth earbuds", "Semantic match"),
("macbook pro", "laptop computer", "Conceptual match"),
("gaming console", "video game system", "Synonym match"),
("smartphone", "feature phone", "Related concepts")
]

for text1, text2, test_type in test_cases:
embeddings = model.encode([text1, text2])
score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
result = "✓" if score >= 0.65 else "✗"
print(f"{result} {test_type}: {score:.3f}")

Output:
✓ Exact match: 1.000
✓ Spacing variation: 0.867
✓ Extra words: 0.818
✓ Word order: 0.988
✗ Semantic match: 0.618
✓ Conceptual match: 0.652
✓ Synonym match: 0.651
✗ Related concepts: 0.600

Sentence Transformers achieves 7/8 successful matches (≥0.65 threshold). Successes: all text variations plus semantic relationships. Failures:

Edge case semantics: “smartphone” vs “feature phone” scores only 0.600
Processing overhead: Neural inference requires significantly more computation than string algorithms
Memory requirements: Models need substantial RAM (100MB+ for basic models, GBs for advanced ones)
Resource scaling: Large datasets may require GPU acceleration for reasonable performance

Sentence Transformers unlocks semantic understanding at computational cost. The decision depends on whether conceptual relationships provide sufficient business value to justify resource overhead.
For implementing semantic search at production scale, see our pgvector and Ollama integration guide.
When to Use Each Tool
Data Preprocessing (Always Start Here)
Use regex for:

Removing special characters and symbols
Standardizing case and formatting
Cleaning messy product names
Preparing text for similarity analysis

Character-Level Similarity
Use difflib when:

Learning text similarity concepts
Working with small datasets (<1000 records)
External dependencies not allowed
Simple typo detection is sufficient

Production Fuzzy Matching
Use RapidFuzz when:

Processing thousands of records
Need fast approximate matching
Handling abbreviations and variations
Text-level similarity is sufficient

Semantic Understanding
Use Sentence Transformers when:

Conceptual relationships matter
“wireless headphones” should match “bluetooth earbuds”
Building recommendation systems
Multilingual content similarity
Compute resources are available

Performance vs Accuracy Tradeoff

Requirement
Recommended Tool

Speed > Accuracy
RapidFuzz

Accuracy > Speed
Sentence Transformers

No Dependencies
difflib

Preprocessing Only
regex

Decision Tree
When facing a new text similarity project, use this visual guide to navigate from problem requirements to the optimal tool selection:

Final Thoughts
When facing complex challenges, start with the most basic solution first, identify where it fails through testing, then strategically upgrade the failing component. This article demonstrates exactly this progression – from simple regex preprocessing to sophisticated semantic understanding.
Build complexity incrementally based on real limitations, not anticipated ones.

📚 For comprehensive production-ready data science practices, check out Production-Ready Data Science.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

PRegEx: Write Human-Readable Regular Expressions in Python

2 Comments / Blog, Python Utilities / Khuyen Tran

Regular expressions (RegEx) are useful in extracting words that follow specific patterns, but they can be challenging to read and create.

PregEx allows you to construct RegEx patterns in a more human-friendly way.

PRegEx: Write Human-Readable Regular Expressions in Python Read More »

text-processing

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

4 Text Similarity Tools: When Regex Isn’t Enough

PRegEx: Write Human-Readable Regular Expressions in Python

Drop a line

Get in touch

Follow Us on Social Media

text-processing

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

4 Text Similarity Tools: When Regex Isn’t Enough

PRegEx: Write Human-Readable Regular Expressions in Python

Work with Khuyen Tran

Work with Khuyen Tran