Regular expressions in simple words. Part 1

Developers are divided into two types: those who already understand regular expressions and sometimes solve complex problems with one line, and those who are still afraid and avoid them in every possible way. This article is specifically for the latter, to make it easier for them to become the first. It will either help overcome “regexpophobia” or make it worse. In any case, welcome to cat.

Use navigation if you don't want to read the entire text:
Introduction
Tools
Hello, world
Special characters
Looking into the abyss

Introduction


A regular expression describes a pattern that text strings can either match or not. Main applications: search, validation, parsing and deterrence.

It is important to remember that although regular expressions are a powerful tool, they are not a silver bullet and they are not Turing complete. Consequently, not all problems can be solved with their help. For example,

on Stack Overflow

explains why you should never parse HTML using regular expressions.

On the other hand, sometimes regular expressions can solve problems for which they were not intended at all. For example, on this page an extremely inefficient but working way to check a number for primality is described.

Writing a “regular expression” every time is tiresome. The terms regex and regexp have taken root in English slang. However, both sound like the names of anime villains, so I will use the word “regular” from time to time in this article.

According to legend, “regular expressions” means “regular expressions,” and they were called regular because they were lazy in translating.

By the way, there is an opinion that the term regex may not always be synonymous with regular expression. But sometimes it can. Details can be read in a short article.

Tools


Of course, of all their tasks, regular fighters cope best with intimidation. However, thanks to the work of leading Egyptologists, services have emerged that help decipher these mysterious writings. I most often use two: one is beautiful, the other is useful. In fact, both are beautiful and useful.

Regexper allows you to turn a regular expression of almost any degree of complexity into a beautifully designed graph. For example, there is such a regular expression for real numbers, invented by the ancient Sumerians:

[+-]?(\d*\.)?\d+

And suddenly it makes sense in the form of an intuitive infographic:

If you found a strange regular program somewhere and want to quickly understand what it does, feel free to throw it into Regexper.

True, for particularly complicated cases the graph will not be simple. For example, try sending the above-mentioned regular email to the service (however, you will first need to write it down in one line). I didn't attach the image here because it is 24,621 pixels wide.

The second extremely useful resource is regex101. Helps you see how the regular season works. And if it doesn’t work or doesn’t work right, then you can understand what’s wrong. The service even has a step-by-step debugger!

Hello, world

Although the regular expression language is not a programming language, we will still follow the tradition. So, open regex101 and enter the regular expression:

hello, world

Yes, yes, regular text can look like human text, and not just like a Klingon obituary. The regular expression “hello, world” matches a single string: “hello, world”.

In (some|many) programming languages, you can give options to a regular expression. If it is Case Insensitive (usually denoted as i), then all case options are suitable: “Hello world”, “Hello World”, “HELLO WORLD”, “hElLo wOrLd”, etc.

In regex101, to do this, just click on the letters with options to the right of the input line.

It's no secret that the word “hello” contains two letters “l”. Let's write this interesting fact using regular expression language:

hel{2}o, world

This is the same expression as the previous one, but now slightly awakens Cthulhu. Note that {2} only works on the letter “l”, not on all the letters that come before it.

Why do we need this? Wouldn't it be easier to just leave “ll”? Of course, in this case it is simpler. However, in real practice there may be more complex situations. Firstly, a symbol can be repeated not twice, but, for example, a thousand. Secondly, it is possible to apply the operation not to one character, but to a group, but more on that another time.

In addition, you can specify not only a specific number of repetitions, but also limits: minimum and maximum. For example:

hel{2,5}o, world

This notation means that the letter “L” can be repeated two to five times.

And if the second number is omitted, but the comma is left, it will already mean “to infinity.”

For example:

hel{2,}o, world

This notation already means that a word can contain any number of letters “l”, but not less than two.

Let's leave the letter “l” alone for now. In our example, the variability will really come in handy when we work with white space. Let's say we need to handle those cases when there is not one space between words, but several:

hello, {1,}world

Urgent edit from the customer: there may be exclamation marks at the end of the phrase, maybe even three at once. Or maybe infinity, or maybe none. So let's write it down:

hello, {1,}world!{0,}

Pay attention to the last two lines. If characters begin that do not match the pattern, this does not reject the entire string, but only the part that does not match. Sometimes it's great, and sometimes it's not what we need at all. We'll discuss what to do with this later.

Since we're talking about punctuation marks: what if someone forgets to put a comma, but we still want to find such options? This means that the comma must occur either zero or once:

hello,{0,1} {1,}world!{0,}

Now it looks like Druid accounting, as befits a good regular program!{3,}

It's already worth checking out in Regexper:

  1. Let's take the word “hello”.
  2. We either take the comma or bypass it.
  3. We take one space exactly, then we spin around it any number of times.
  4. Let's take the word “world”.
  5. We either go around the exclamation mark or go around it.

Note that a line with no space at all does not match the pattern. The point is that we indicated that there must be at least one space. We'll look at this case a little later.

Special characters


In order not to overfill the regular sequence with numbers, it can be overfilled with special characters. For some of the most common options, there is a shorthand way of writing:

Let's rewrite the expression shorter:

hello,? +world!*

Now the expression looks almost like the original “hello, world”, but how much more functional it has become!

By the way, the regular expression that matches both names of the villains in English is regexp?

But what if you need to treat special characters as ordinary ones: a plus as a plus, a question as a question? We can escape them using escape characters (in English literature this is called escape character). A slash before a special character means that it loses its superpowers and becomes normal. Such kryptonite in the world of regular season games.

Example:

hello\?

Matches the string “hello?”, and not “hello\” and “hello”.

A slash can also escape itself. For this to work, it needs to be written twice: \\

In some programming languages, the slash is already an escape character in a string, so to make it become a regular expression escape character, you need to write \\. And to make it just a slash, you need to write \\\\. Yes, for a slash to remain itself, it must be repeated four times. Tell this to people who don’t understand why IT people get paid so much.

Let's go back to “hello, world”. Someone might accidentally write “hallo” instead of “hello.” Let's process this case too.

To select one character from several, you need to list them inside square brackets. Expression h[ea]llo matches both “hello” and “hallo”.

Since we're talking about other languages, let's think about Russian. How can I make both “hello” and “hello” fit?

If you write [helloпривет]this expression will simply match one character from the set h,e,l,o, p, p, i, v, e, t. That is, one letter matches, but the whole word does not. What should I do?

There is a special symbol for this: |. Let's write hello|hello.

Wait, can you use the Cyrillic alphabet in regular languages? Certainly! At least emoji. Although, of course, this depends on the application or programming language in which you will use them, but there is no direct prohibition on this matter.

If you write hello|hello, world, then the word “world” will remain on the right side. That is, we will receive either “hello” or “hello, world”.

Is it possible to somehow leave “world” general (even though it will look strange in combination with hello)? You can use parentheses to limit the effect of a vertical bar. In general, this is far from the only, and moreover, not the main task of parentheses, but more on that later. In the meantime:

(hello|привет), world

You can write as many options as you like:

(hello|привет|bonjour), world

Note that you can put spaces around the straight line to make it look nice:

(hello | привет | bonjour), world

But in this case, the expression will also change, since spaces are not just for beauty, but are part of the expression. Therefore, do not insert spaces unless you are sure they are needed.

By the way, about the spaces. There are people who do not put a space after the comma, but there are those who put it in front of it. As Sheldon Cooper would say, there isn't enough chamomile tea in the world to calm the rage in my chest. But what can you do? Such cases may also need to be handled.

Let's try this option:

hello *,? *world

Works.

But there are two problems. The first is the line “helloworld”. Since we have spaces with an asterisk and a comma in question, even those options that have no separators at all are suitable. In the training example there is nothing wrong, but in real practice this is a common problem: how to make sure that among the many optional options at least one is still present? This can be done in different ways: square and round.

Square method

hello *[, ] *world

After “hello” there may be spaces (possibly none), then there must definitely be some kind of separator (comma or space), then another pack of spaces, possibly empty.

Round way

hello( *, *| +)world

Here we have two options to choose from:

Which method is more preferable depends on the specific case.

The first problem has been solved, but there is a second problem: in the expression itself, the space is not clearly expressed enough. It is not immediately obvious that the asterisk refers specifically to him.

Remember when we said that the slash turns special characters into mere mortals? Well, it works the other way around too. If you put a slash in front of a simple Muggle symbol, he will gain superpowers. Of course, this doesn't work for everyone, only for those who have enough midi-chlorians.

For example, the great thing about the letter s is that it starts the word space, so \s is a special character that denotes a space. And not just a space, but in general everything space-like: Tab and, with certain options, line breaks.

Let's rewrite our expression:

hello(\s*,\s*|\s+)world

It is already beginning to resemble the ancient Egyptian Glagolitic alphabet, as befits a worthy regular expression. But at the same time, we still roughly understand what it does! This is the main secret of regular letters – whoever writes them understands them.

Looking into the abyss


Now let's put it all together:

(h[ea]llo|привет|bonjour)(\s*,\s*|\s+)world!*

The level of occultism has reached an acceptable level.

The article contains from zero to infinity inaccuracies and simplifications. Some of them were intentionally included in order to preserve the entertaining spirit of the article and not scare you away with excessive pedantry and academicism.

In the next publication we will look at the various features of regular expressions in more detail.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *