PRegEx: Write Human-Readable Regular Expressions

Motivation

Imagine you are trying to find all URLs inside a text. Each of these URLs must:

Start with either http:// or https:// or the domain of the URL
End with either .com or .org

You might end up writing a complicated regular expression(RegEx) like the below:

This RegEx is difficult to read and create. Is there a way that you can write a more human-readable RegEx with ease?

That is when PRegEx comes in handy.

What is PRegEx?

PRegEx is a Python package that allows you to construct RegEx patterns in a more human-friendly way.

To install PRegEx, type:

pip install pregex

The version of PRegEx that will be used in this article is 2.0.1:

pip install pregex==2.0.1

To learn how to use PRegEx, let’s start with some examples.

Capture URLs

Get a Simple URL

First, we will try to get a URL in a text using PRegEx.

from pregex.core.classes import AnyButWhitespace
from pregex.core.quantifiers import OneOrMore
from pregex.core.operators import Either

text = "You can find me through GitHub https://github.com/khuyentran1401"

pre = (
    "https://"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org")
    + OneOrMore(AnyButWhitespace())
)

Output:

['https://github.com/khuyentran1401']

In the code above, we use:

AnyButWhitespace() to match any character except for whitespace characters
OneOrMore() to match the provided pattern one or more times.
Either to match either one of the provided patterns.

Specifically,

OneOrMore(AnyButWhitespace()) matches one or more characters that are not whitespace characters.
Either(".com", ".org) matches either .com or .org .

HTTP or HTTPS

Sometimes, a URL might use the scheme http instead of https . Let’s make the character s optional by using Optional() :

from pregex.core.quantifiers import Optional

text = "You can find me through GitHub http://github.com/khuyentran1401"

pre = (
    "http"
    + Optional("s")
    + "://"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org")
    + OneOrMore(AnyButWhitespace())
)
pre.get_matches(text)

Output:

['http://github.com/khuyentran1401']

Match URL without a Scheme

Some URLs in a text might not include a scheme such at https or http . Let’s make the scheme optional with Optional .

To make our code more readable, we will assign PRegrex’s methods to a variable.

text = "You can find me through my website codecut.ai/ or GitHub https://github.com/khuyentran1401"

at_least_one_character_except_white_space = OneOrMore(AnyButWhitespace())
optional_scheme = Optional("http" + Optional("s") + "://")
domain_choice = Either(".com", ".org", ".ai")

pre = (
    optional_scheme
    + at_least_one_character_except_white_space
    + domain_choice
    + at_least_one_character_except_white_space
)
pre.get_matches(text)

Output:

['codecut.ai/', 'https://github.com/khuyentran1401']

Let’s take a look at the underlying RegEx pattern:

pre.get_pattern()

(?:https?\:\/\/)?[^\s]+(?:\.com|\.org)[^\s]+

We have just avoided creating a complicated pattern with some human-readable lines of code!

Capture Time

AnyDigit() matches any numeric character. Let’s use this to match a time in a text.

from pregex.classes import AnyDigit

text = "It is 6:00 pm now"
pre = AnyDigit() + ":" + AnyDigit()
pre.get_matches(text)

['6:0']

Right now, we only match one digit on either side of : . Let’s make this more general by wrapping AddLeastOnce() around AnyDigit() :

pre = OneOrMore(AnyDigit()) + : + AtLeastOnce(AnyDigit())
pre.get_matches(text)

['6:00']

Capture Phone Numbers

Common formats for a phone number are:

##########
###-###-####
### ### ####
###.###.####

These formats either have punctuation or nothing between numbers. We can use AnyFrom("-", " ", ".") to match either - , . , or space.

We also use Optional() to make punctuation optional.


text = "My phone number is 3452352312 or 345-235-2312 or 345 235 2312 or 345.235.2312"

punctuation = AnyFrom("-", " ", ".")
optional_punctuation = Optional(punctuation)
at_least_one_digit = OneOrMore(AnyDigit())

pre = (
    at_least_one_digit
    + optional_punctuation
    + at_least_one_digit
    + optional_punctuation
    + at_least_one_digit
)
pre.get_matches(text)

['3452352312', '345-235-2312', '345 235 2312', '345.235.2312']

Capture an Email Address

Now let’s utilize what we have learned so far to capture an email address from a text.

text = "My email is abcd@gmail.com"

pre = (
    OneOrMore(AnyButWhitespace())
    + "@"
    + OneOrMore(AnyButWhitespace())
    + Either(".com", ".org", ".io", ".net")
)

pre.get_matches(text)

Output:

['abcd@gmail.com']

Next Step

This article gives you an overview of how to use PRegEx to match complicated patterns without spending hours on them.

I encouraged you to check out PRegEx’s documentation for other useful methods.

Feel free to play and fork the source code of this article here.

Feature Engineer

PRegEx: Write Human-Readable Regular Expressions

Motivation

What is PRegEx?

Capture URLs

Get a Simple URL

HTTP or HTTPS

Match URL without a Scheme

Capture Time

Capture Phone Numbers

Capture an Email Address

Next Step

Related Posts

3 Essential Tools for Version Controlling Jupyter Notebooks

3 Tools to Track and Visualize the Execution of your Python Code

Handling Imbalanced Datasets with imbalanced-learn

Related Posts

Handling Imbalanced Datasets with imbalanced-learn

Automated Misspelling Correction in Datasets Using skrub

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Get Started

Follow Us

Newsletter

PRegEx: Write Human-Readable Regular Expressions

Motivation

What is PRegEx?

Capture URLs

Get a Simple URL

HTTP or HTTPS

Match URL without a Scheme

Capture Time

Capture Phone Numbers

Capture an Email Address

Next Step

Related Posts

3 Essential Tools for Version Controlling Jupyter Notebooks

3 Tools to Track and Visualize the Execution of your Python Code

Handling Imbalanced Datasets with imbalanced-learn

Related Posts

Handling Imbalanced Datasets with imbalanced-learn

Automated Misspelling Correction in Datasets Using skrub

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Get Started

Follow Us

Newsletter

Work with Khuyen Tran

Work with Khuyen Tran