Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

PRegEx: Write Human-Readable Regular Expressions in Python

Table of Contents

PRegEx: Write Human-Readable Regular Expressions in Python

Motivation

Imagine you are trying to find all URLs inside a text. Each of these URLs must:

  • Start with either http:// or https:// or the domain of the URL
  • End with either .com or .org

You might end up writing a complicated regular expression(RegEx) like the one below:

import re

text = """You can find me through my
website codecut.ai/ or
GitHub https://github.com/khuyentran1401"""

re.findall(
    "(?:https?://)?[^\s]+(?:\.com|\.org)[^\s]+",
    text
)
"""
[
    'codecut.ai/',
    'https://github.com/khuyentran1401'
]
"""

This RegEx is difficult to read and create. Is there a way that you can write a more human-readable RegEx with ease?

That is when PRegEx comes in handy.

What is PRegEx?

PRegEx is a Python package that allows you to construct RegEx patterns in a more human-friendly way.

To install PRegEx, type:

pip install pregex

The version of PRegEx that will be used in this article is 2.0.1:

pip install pregex==2.0.1

To learn how to use PRegEx, let’s start with some examples.

Capture URLs

Get a Simple URL

First, let’s try to get a URL in a text using PRegEx.

from pregex.core.classes import AnyButWhitespace
from pregex.core.quantifiers import OneOrMore
from pregex.core.operators import Either

text = "You can find me through GitHub https://github.com/khuyentran1401"

pre = (
    "https://"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org")
    + OneOrMore(AnyButWhitespace())
)

Output:

['https://github.com/khuyentran1401']

In the code above, we use:

  • AnyButWhitespace() to match any character except for whitespace characters
  • OneOrMore() to match the provided pattern one or more times.
  • Either to match either one of the provided patterns.

Specifically,

  • OneOrMore(AnyButWhitespace()) matches one or more characters that are not whitespace characters.
  • Either(".com", ".org) matches either .com or .org .

HTTP or HTTPS

Sometimes, a URL might use the scheme http instead of https . Let’s make the character s optional by using Optional() :


from pregex.core.quantifiers import Optional

text = "You can find me through GitHub One"

pre = (
    "http"
    + Optional("s")
    + "://"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org")
    + OneOrMore(AnyButWhitespace())
)
pre.get_matches(text)

Output:

['http://github.com/khuyentran1401']

Match URL without a Scheme

Some URLs in a text might not include a scheme such at https or http . Let’s make the scheme optional with Optional .

To make our code more readable, we will assign PRegrex’s methods to a variable.

text = "You can find me through my website mathdatasimplified.com/ or GitHub https://github.com/khuyentran1401"

at_least_one_character_except_white_space = OneOrMore(AnyButWhitespace())
optional_scheme = Optional("http" + Optional("s") + "://")
domain_choice = Either(".com", ".org")

pre = (
    optional_scheme
    + at_least_one_character_except_white_space
    + domain_choice
    + at_least_one_character_except_white_space
)
pre.get_matches(text)

Output:

['mathdatasimplified.com/', 'https://github.com/khuyentran1401']

Let’s take a look at the underlying RegEx pattern:

pre.get_pattern()
(?:https?\:\/\/)?[^\s]+(?:\.com|\.org)[^\s]+

We have just avoided creating a complicated pattern with some human-readable lines of code!

Capture Time

AnyDigit() matches any numeric character. Let’s use this to match a time in a text.

from pregex.classes import AnyDigit

text = "It is 6:00 pm now"
pre = AnyDigit() + ":" + AnyDigit()
pre.get_matches(text)
['6:0']

Right now, we only match one digit on either side of : . Let’s make this more general by wrapping AddLeastOnce() around AnyDigit() :

pre = OneOrMore(AnyDigit()) + : + AtLeastOnce(AnyDigit())
pre.get_matches(text)
['6:00']

Capture Phone Numbers

Common formats for a phone number are:

##########
###-###-####
### ### ####
###.###.####

These formats either have punctuation or nothing between numbers. We can use AnyFrom("-", " ", ".") to match either - , . , or space.

We also use Optional() to make punctuation optional.


text = "My phone number is 3452352312 or 345-235-2312 or 345 235 2312 or 345.235.2312"

punctuation = AnyFrom("-", " ", ".")
optional_punctuation = Optional(punctuation)
at_least_one_digit = OneOrMore(AnyDigit())

pre = (
    at_least_one_digit
    + optional_punctuation
    + at_least_one_digit
    + optional_punctuation
    + at_least_one_digit
)
pre.get_matches(text)
['3452352312', '345-235-2312', '345 235 2312', '345.235.2312']

Capture an Email Address

Now let’s utilize what we have learned so far to capture an email address from a text.

text = "My email is abcd@gmail.com"

pre = (
    OneOrMore(AnyButWhitespace())
    + "@"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org", ".io", ".net")
)

pre.get_matches(text)

Output:

['abcd@gmail.com']

Next Step

This article gives you an overview of how to use PRegEx to match complicated patterns without spending hours on them.

I encouraged you to check out PRegEx’s documentation for other useful methods.


I love writing about data science concepts and playing with different data science tools. You can stay up-to-date with my latest posts by:

Related Posts

2 thoughts on “PRegEx: Write Human-Readable Regular Expressions in Python”

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran