Python regular expressions for beginners: what it is, why and what for

image

Over the past few years, machine learning, data science, and related industries have made great strides forward. More and more companies and developers are using Python and JavaScript to work with data.

And this is where we need regular expressions. Whether parsing all or portions of text from web pages, analyzing Twitter data, or preparing data for text analysis – regular expressions come to the rescue.

By the way, Alexey Nekrasov, the leader of the Python department at MTS, and the program director of the Python department at Skillbox, added his advice on some functions. To make it clear where the translation is, and where are the comments, we will highlight the latter with a quote.

Why are regular expressions needed?

They help to quickly solve a variety of tasks when working with data:

  • Determine the required data format, including phone number or e-mail address.
  • Split strings into substrings.
  • Search, extract and replace characters.
  • Perform non-trivial operations quickly.

The good news is that the syntax of most of these expressions is standardized, so you need to understand them once, after which you can use them anytime, anywhere. And not only in Python, but also in any other programming languages.

When are regular expressions unnecessary? When there is a similar built-in function in Python, and there are quite a few of them.

What about regular expressions in Python?

There is a special re module here, which is exclusively for working with regular expressions. This module needs to be imported, after which you can start using regulars.

import re

As for the most popular methods provided by the module, here they are:

  • re.match ()
  • re.search ()
  • re.findall ()
  • re.split ()
  • re.sub ()
  • re.compile ()

Let’s take a look at each of them.

re.match (pattern, string)

The method is designed to search for a given pattern at the beginning of a string. So, if you call the match () method on the line “AV Analytics AV” with the template “AV”, then it will be completed successfully.

import re
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result) 
Результат:
<_sre.SRE_Match object at 0x0000000009BE4370>

Here we found the required substring. The group () method is used to display its contents. This uses “r” in front of the template string to indicate that it is a raw string in Python.

result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.group(0))
 
Результат:
AV

Okay, now let’s try to find “Analythics” on the same line. We won’t succeed, since the line begins with “AV”, the method returns none:

result = re.match(r'Analytics', 'AV Analytics Vidhya AV')
print(result)
 
Результат:
None

The start () and end () methods are used to find out the start and end position of the found string.

result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.start())
print(result.end())
 
Результат:
0
2

All of these methods are extremely useful when working with strings.

re.search (pattern, string)

This method is similar to match (), but the difference is that it searches not only at the beginning of a string. So, search () returns an object if we try to find “Analythics”.

result = re.search(r'Analytics', 'AV Analytics Vidhya AV')
print(result.group(0))
 
Результат:
Analytics

As for the search () method, it searches the entire string, returning, however, only the first match it finds.

re.findall (pattern, string)

Here we have a return of all found matches. For example, the findall () method has no restrictions on whether to search at the beginning or end of a line. For example, if you search for “AV” in a string, then we get all occurrences of “AV” returned. It is recommended to use this method for searching, since it knows how to work both re.search () and re.match ().

result = re.findall(r'AV', 'AV Analytics Vidhya AV')
print(result)
 
Результат:
['AV', 'AV']

re.split (pattern, string, [maxsplit=0])

This method splits a string according to a given pattern.

result = re.split(r'y', 'Analytics')
print(result)
 
Результат:
['Anal', 'tics']

In this example, the word “Analythics” is separated by the letter “y”. The split () method here also accepts a maxsplit argument with a default value of 0. Thus, it splits the string as many times as possible. However, if you specify this argument, then the division cannot be performed more than the specified number of times. Here are some examples:

result = re.split(r'i', 'Analytics Vidhya')
print(result)
 
Результат:
['Analyt', 'cs V', 'dhya'] # все возможные участки.
 
result = re.split(r'i', 'Analytics Vidhya', maxsplit=1)
print(result)
 
Результат:
['Analyt', 'cs Vidhya']

Here the maxsplit parameter is set to 1, which results in the line being split into two instead of three.

re.sub (pattern, repl, string)

Helps to find a pattern in a string, replacing it with the specified substring. If the desired item is not found, then the string remains unchanged.

result = re.sub(r'India', 'the World', 'AV is largest Analytics community of India')
print(result)
 
Результат:
'AV is largest Analytics community of the World'

re.compile (pattern, repl, string)

Here we can collect the regular expression into an object, which in turn can be used for searching. This option avoids rewriting the same expression.

pattern = re.compile('AV')
result = pattern.findall('AV Analytics Vidhya AV')
print(result)
result2 = pattern.findall('AV is largest analytics community of India')
print(result2)
 
Результат:
['AV', 'AV']
['AV']

Up to this point, we have considered the option with the search for a specific sequence of characters? In this case, there is no pattern, the set of characters must be returned in the order corresponding to certain rules. This is a common task when dealing with retrieving information from strings. And this is easy to do, you just need to write an expression using a special. characters. The most common ones are:

  • … Any single character except newline n.
  • ? 0 or 1 occurrence of the pattern to the left
  • + 1 or more occurrences of the pattern on the left
  • * 0 or more occurrences of the pattern on the left
  • w Any number or letter ( W – everything except letter or number)
  • d Any digit [0-9] ( D – everything except digit)
  • s Any whitespace character ( S is any non-whitespace character)
  • b Word boundary
  • [..] One of the characters in parentheses ([^..] – any character except those in brackets)
  • Escaping special characters (. Stands for period or + for plus sign)
  • ^ and $ Beginning and end of line respectively
  • {n, m} n to m occurrences ({, m} – 0 to m)
  • a | b Matches a or b
  • () Groups the expression and returns the found text
  • t, n, r Tab, newline, and carriage return, respectively

It is clear that there may be more symbols. Information about them can be found in documentation for regular expressions in Python 3.

Some examples of using regular expressions

Example 1. Returning the first word from a string

Let’s first try to get each character using (.)

result = re.findall(r'.', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']

Now we will do the same, but so that the final result does not include a space, we use w instead of (.)

result = re.findall(r'w', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']

Now let’s do a similar operation with each word. We use in this case * or +.

result = re.findall(r'w*', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']

But even here, as a result, there were gaps. Reason – * means “zero or more characters”. The “+” will help us remove them.

result = re.findall(r'w+', 'AV is largest Analytics community of India')
print(result)
Результат:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

Now let’s extract the first word using
^:

result = re.findall(r'^w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV']

But if you use $ instead of ^, then we get the last word, not the first:

result = re.findall(r'w+$', 'AV is largest Analytics community of India')
print(result)
 
Результат:
[‘India’]
 

Example 2. Returning two characters of each word

Here, as above, there are several options. In the first case, using w, we extract two consecutive characters, except for those with spaces, from each word:

result = re.findall(r'ww', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']

Now we try to extract two consecutive characters using the word boundary character ( b):

result = re.findall(r'bw.', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', 'is', 'la', 'An', 'co', 'of', 'In']

Example 3. Returning domains from a list of email addresses.

In the first step, we return all characters after the @:

result = re.findall(r'@w+', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
Результат:
['@gmail', '@test', '@analyticsvidhya', '@rest']

As a result, the parts “.com”, “.in”, etc. do not end up in the result. To fix this, you need to change the code:

result = re.findall(r'@w+.w+', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
Результат:
['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

The second solution to the same problem is to extract only the top-level domain using “()”:

result = re.findall(r'@w+.(w+)', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
Результат:
['com', 'in', 'com', 'biz']

Example 4. Getting a date from a string

To do this, you need to use d

result = re.findall(r'd{2}-d{2}-d{4}', 'Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)
 
Результат:
['12-05-2007', '11-11-2011', '12-01-2009']

To extract only the year, the parentheses help:

result = re.findall(r'd{2}-d{2}-(d{4})', 'Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)
 
Результат:
['2007', '2011', '2009']

Example 5. Extracting words starting with a vowel

At the first stage, you need to return all the words:

result = re.findall(r'w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

After that, only those that begin with certain letters, using “[]”:

result = re.findall(r'[aeiouAEIOU]w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']

In the resulting example, there are two shortened words, “argest” and “ommunity”. In order to remove them, you need to use b, which is necessary to denote a word boundary:

result = re.findall(r'b[aeiouAEIOU]w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['AV', 'is', 'Analytics', 'of', 'India']

Alternatively, you can use and ^ inside square brackets to help invert groups:

result = re.findall(r'b[^aeiouAEIOU]w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
[' is', ' largest', ' Analytics', ' community', ' of', ' India']

Now we need to remove words with a space, for which we include the space in the range in square brackets:

result = re.findall(r'b[^aeiouAEIOU ]w+', 'AV is largest Analytics community of India')
print(result)
 
Результат:
['largest', 'community']

Example 6. Checking the format of a telephone number

In our example, the length of the number is 10 characters, it starts with 8 or 9. To check the list of phone numbers, use:

li = ['9999999999', '999999-999', '99999x9999']
 
for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}', val) and len(val) == 10:
            print('yes')
    else:
            print('no')
 
Результат:
yes
no
no

Example 7. Splitting a string into multiple delimiters

Here we have several solutions. Here’s the first one:

line="asdf fjdk;afed,fjek,asdf,foo" # String has multiple delimiters (";",","," ").
result = re.split(r'[;,s]', line)
print(result)
 
Результат:
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

Alternatively, the re.sub () method can be used to replace all delimiters with spaces:

line="asdf fjdk;afed,fjek,asdf,foo"
result = re.sub(r'[;,s]', ' ', line)
print(result)
 
Результат:
asdf fjdk afed fjek asdf foo

Example 8. Extracting data from an html file

In this example, we extract data from an html file that is enclosed between and, except for the first column with a number. We also assume that the html code is contained in the string.

Sample file

one Noah Emma

2 Liam Olivia

3 Mason Sophia

four Jacob Isabella

five William Ava

6 Ethan Mia

7 Michael Emily

In order to solve this problem, we perform the following operation:

result=re.findall(r'<td>w+</td>s<td>(w+)</td>s<td>(w+)</td>',str)
print(result)
Output:
[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia'), ('Michael', 'Emily')]

Alexey’s comment

When writing any regex in the code, adhere to the following rules:

  • Use re.compile for any more or less complex and long regular expressions. Also, avoid calling re.compile multiple times on the same regex.
  • Write verbose regular expressions using the optional re.VERBOSE argument. When re.compile use the re.VERBOSE flag, write regex on multiple lines with comments on what is going on. See documentation on links here and here

Example:
compact view

pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
re.search(pattern, 'MDLV')

Detail view

pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 Ms
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 Cs),
                        #            or 500-800 (D, followed by 0 to 3 Cs)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 Xs),
                        #        or 50-80 (L, followed by 0 to 3 Xs)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 Is),
                        #        or 5-8 (V, followed by 0 to 3 Is)
    $                   # end of string
    """
re.search(pattern, 'M’, re.VERBOSE)

Use named capture group for all capture groups if there is more than one (? P …). (even if there is only one capture, it is also better to use).
regex101.com is a great site for debugging and checking regex

When developing a regular expression, you must not forget about its complexity, otherwise you can step on the same rake that Cloudflare did relatively recently.

Similar Posts

Leave a Reply