Motivation
Imagine you are trying to find all URLs inside a text. Each of these URLs must:
- Start with either
http://
orhttps://
or the domain of the URL - End with either
.com
or.org
You might end up writing a complicated regular expression(RegEx) like like the below:
This RegEx is difficult to read and create. Is there a way that you can write a more human-readable RegEx with ease?
That is when PRegEx comes in handy.
What is PRegEx?
PRegEx is a Python package that allows you to construct RegEx patterns in a more human-friendly way.
To install PRegEx, type:
pip install pregex
The version of PRegEx that will be used in this article is 2.0.1:
pip install pregex==2.0.1
To learn how to use PRegEx, let’s start with some examples.
Capture URLs
Get a Simple URL
First, we will try to get a URL in a text using PRegEx.
from pregex.core.classes import AnyButWhitespace
from pregex.core.quantifiers import OneOrMore
from pregex.core.operators import Either
text = "You can find me through GitHub https://github.com/khuyentran1401"
pre = (
"https://"
+ OneOrMore(AnyButWhitespace())
+ Either(".com", ".org")
+ OneOrMore(AnyButWhitespace())
)
Output:
['https://github.com/khuyentran1401']
In the code above, we use:
AnyButWhitespace()
to match any character except for whitespace charactersOneOrMore()
to match the provided pattern one or more times.Either
to match either one of the provided patterns.
Specifically,
OneOrMore(AnyButWhitespace())
matches one or more characters that are not whitespace characters.Either(".com", ".org)
matches either.com
or.org
.
HTTP or HTTPS
Sometimes, a URL might use the scheme http
instead of https
. Let’s make the character s
optional by using Optional()
:
from pregex.core.quantifiers import Optional
text = "You can find me through GitHub http://github.com/khuyentran1401"
pre = (
"http"
+ Optional("s")
+ "://"
+ OneOrMore(AnyButWhitespace())
+ Either(".com", ".org")
+ OneOrMore(AnyButWhitespace())
)
pre.get_matches(text)
Output:
['http://github.com/khuyentran1401']
Match URL without a Scheme
Some URLs in a text might not include a scheme such at https
or http
. Let’s make the scheme optional with Optional
.
To make our code more readable, we will assign PRegrex’s methods to a variable.
text = "You can find me through my website codecut.ai/ or GitHub https://github.com/khuyentran1401"
at_least_one_character_except_white_space = OneOrMore(AnyButWhitespace())
optional_scheme = Optional("http" + Optional("s") + "://")
domain_choice = Either(".com", ".org", ".ai")
pre = (
optional_scheme
+ at_least_one_character_except_white_space
+ domain_choice
+ at_least_one_character_except_white_space
)
pre.get_matches(text)
Output:
['codecut.ai/', 'https://github.com/khuyentran1401']
Let’s take a look at the underlying RegEx pattern:
pre.get_pattern()
(?:https?\:\/\/)?[^\s]+(?:\.com|\.org)[^\s]+
We have just avoided creating a complicated pattern with some human-readable lines of code!
Capture Time
AnyDigit()
matches any numeric character. Let’s use this to match a time in a text.
from pregex.classes import AnyDigit
text = "It is 6:00 pm now"
pre = AnyDigit() + ":" + AnyDigit()
pre.get_matches(text)
['6:0']
Right now, we only match one digit on either side of :
. Let’s make this more general by wrapping AddLeastOnce()
around AnyDigit()
:
pre = OneOrMore(AnyDigit()) + : + AtLeastOnce(AnyDigit())
pre.get_matches(text)
['6:00']
Capture Phone Numbers
Common formats for a phone number are:
##########
###-###-####
### ### ####
###.###.####
These formats either have punctuation or nothing between numbers. We can use AnyFrom("-", " ", ".")
to match either -
, .
, or space.
We also use Optional()
to make punctuation optional.
text = "My phone number is 3452352312 or 345-235-2312 or 345 235 2312 or 345.235.2312"
punctuation = AnyFrom("-", " ", ".")
optional_punctuation = Optional(punctuation)
at_least_one_digit = OneOrMore(AnyDigit())
pre = (
at_least_one_digit
+ optional_punctuation
+ at_least_one_digit
+ optional_punctuation
+ at_least_one_digit
)
pre.get_matches(text)
['3452352312', '345-235-2312', '345 235 2312', '345.235.2312']
Capture an Email Address
Now let’s utilize what we have learned so far to capture an email address from a text.
text = "My email is abcd@gmail.com"
pre = (
OneOrMore(AnyButWhitespace())
+ "@"
+ OneOrMore(AnyButWhitespace())
+ Either(".com", ".org", ".io", ".net")
)
pre.get_matches(text)
Output:
['abcd@gmail.com']
Next Step
This article gives you an overview of how to use PRegEx to match complicated patterns without spending hours on them.
I encouraged you to check out PRegEx’s documentation for other useful methods.
Feel free to play and fork the source code of this article here.