Using regular expression functions in PostgreSQL

Let’s start with a joke:

“Let’s say you have a problem and you want to solve it with regular expressions. Now you have two problems.”

Many articles and tutorials have already been written about how to use regular expressions. Another question is that regular expressions themselves are a parameter that is fed to the input of some function. It is the function that searches for the specified regular expression. We will talk about functions using regular expressions today. Perhaps I’ll bang my guide on the regular seasons themselves, but, as it was said in that very film: “We’ll definitely bang more than once, the whole world is in dust! But after…”

Problem case

So I ran into a filtering problem in Greenplum. It’s not that this is a big problem, it just wasn’t solved by the functions familiar to me from other DBMSs. More specifically, the TIN column in one of our data marts must contain either numeric values ​​(10/12 digits) or an XX placeholder and digits. However, over time, some incomprehensible text values ​​began to appear there due to the fact that this data comes to us from CRM, and they are entered into CRM itself without checking for the format. And when there is no TIN, then the credit manager begins to write there whatever his heart desires. Believe it or not, I even came across attributes in the form of poop emoji in such attributes. And such a serious office in appearance … And for a long time the business unit insisted that it should be so, but after the transition of DWH to a new architecture, these requirements became irrelevant. And then it was like in that old cartoon.

There is only one line in the telegram:

“Keep out strangers! Dot”

The essence of the problem

So, in Greenplum there was a need for filtering by template, the functions I was used to from other DBMSs in Greenplum did not work, and, plunging into the documentation, I realized that there was a lot, but really useful information, along with examples, could fit on a notebook sheet. Different DBMS have different data management functions. Moreover, they can differ even for one DBMS of different versions, so I will immediately make a reservation that these functions are relevant for PostgreSQL 9.4.26 (and higher), since it is on this version that the Greenplum DBMS kernel (6.22) is built, which we use.

Let me remind you that regular expressions are patterns, a set of rules by which we search for a substring in a string. All. Nothing complicated.

The main tasks that I had to deal with when working with regular expressions in databases:

  • Check if a string matches a pattern

  • Find a substring matching some pattern in a string

  • Split a string into an array by some pattern delimiter

  • Split a string into an array according to some pattern separator and pull this array into a column

  • Replace a substring in a string with a new substring

I understand that many of you here are looking for the answer to a specific question, so I won’t bore you. In short, in Greenplum/PostgreSQL we will use the following expressions.

like

With expression like it’s simple: it’s checking a string against a simplified pattern like ‘Ivan%’, where the string starts with the word “Ivan” and continues with any number of characters. However, there are cases when such simple patterns are not enough: we may be interested in a specific set of characters built according to special rules. Yes, we are looking for the simplest patterns using the Like expression, but it does not work with regular expressions. In MySql, there is, for example, an rlike expression for searching by a regular expression, but for Greenplum it is irrelevant.

similar to

If we want to use a regular expression for filtering, then we use the expression similar to.

select id, inn, date_from
from schema_dds.clients
where inn similar to '(XX)?[0-9]{10,12}'

id

inn

date_from

1337

1234567890

2022-05-01

1338

XX0987654321

2022-05-01

1339

111222333444

2022-05-01

substring

Function substring can be used not only as substring (attribute, start character index, how many characters to take)but also in the form substring(attribute, template). Then, as a result of the execution of this function, we will get a substring corresponding to this pattern.

select 
  client,
  info,
  substring(info, '[0-9\-\+ ]{10,}') as number
from schema_dds.clients

client

info

number

541

Loves cats.

603

High. 83912672222. House

83912672222

678

Number +79991112222

+79991112222

If a function with a regular expression does not find a substring and returns NULL, then one of two things: either there is no substring according to the desired pattern, or (most often) you wrote the regular expression incorrectly.

regexp_replace

Regexp_replace is used to take some string, find a substring in it according to the specified pattern and replace it with a new substring. All. And although no, not all. Suddenly, we have a substring included in the string several times?

An important nuance that solves this problem is the literal flags that can be passed as an optional argument. For what? Yes, they just change the behavior of the function. The two most common flags are ‘g’ and ‘i’. The ‘i’ flag means “ignore case”, and if the ‘g’ flag is not set, then the regular expression will only react to the first occurrence of the pattern in the string. If we want to work with all occurrences of a substring in a string, then we use the ‘g’ flag. As an example, a depersonalization task may arise when we need to hide some important information:

select regexp_replace('Комната 402. Самая жесть творилась в 402 комнате', '[0-9]{3}', 'XXX') as foobar
-- Результат: Комната XXX. Самая жесть творилась в 402 комнате 

select regexp_replace('Комната 402. Самая жесть творилась в 402 комнате', '[0-9]{3}', 'XXX', 'g') as foobar
-- Результат: Комната XXX. Самая жесть творилась в XXX комнате

regexp_split_to_array

It often happens that the data in the storefront does not even satisfy the first normal form.

Now all the analysts who learned SQL from courses on the Internet are like: “What is the first normal form? Hmmmm.”

That is, in one attribute, there may also be a string, which in essence is a set of some things. To put it wisely, the requirement of atomicity is violated. For example, often a full name is written in one attribute, although it would be more reasonable to separate them.

Regexp_split_to_array is used to split a string into an array of strings according to a pattern (usually a sequence of whitespace characters). As a result, we get an array, the elements of which can be accessed by index from 1 to n (in PostgreSQL, the numbering of array elements starts from 1, not 0).

Whenever possible, for the sake of demonstration, I will use subqueries so that the function call is on an attribute, and not just a string. So it’s much easier to understand.

select fio, 
  regexp_split_to_array(fio, ‘\s+’) as fio_arr,
  (regexp_split_to_array(fio, ‘\s+’))[1] as f,
  (regexp_split_to_array(fio, ‘\s+’))[2] as i,
  (regexp_split_to_array(fio, ‘\s+’))[3] as o
from 
  (select ‘Сидоров Иван  Петрович’ as fio) as foo

Note that in the full name between the name and patronymic, I deliberately indicated several spaces. This happens when the data is entered manually, and someone inadvertently added two spaces instead of one. Therefore, here I inserted a fairly simple regular expression \s+, which denotes any sequence of whitespace characters (such as spaces, tabs, or line breaks). Thus, the splitting of our string will be at any number of spaces. The query result will look like this:

fio

fio_arr

f

i

o

Sidorov Ivan Petrovich

{Sidorov, Ivan, Petrovich}

Sidorov

Ivan

Petrovich

regexp_split_to_table

The function works much more interesting. regexp_split_to_table. It works similar to the regexp_split_to_array function, with the only difference being that it splits a string into an array and pulls that array into a column. So, if we have an attribute divided into three parts by a regular expression, then the number of rows in the resulting table will increase by 3 times. Consider the following example. We have a showcase of fruit sellers, in which we again see a violation of the atomicity property.

seller_id

seller

fruits

1

Almazbek

pears apples watermelons

2

Vitaly

apples bananas guava

But we want to work with each fruit separately. Then we use the following query:

select 
  seller_id,
  seller,
  regexp_split_to_table(fruits, '\s+') as fruit
from the_market
order by 1,2

As a result, we have a table in which, in this particular case, the number of rows has increased by 3 times:

seller_id

seller

fruits

1

Almazbek

pears

1

Almazbek

apples

1

Almazbek

watermelons

2

Vitaly

apples

2

Vitaly

bananas

2

Vitaly

guava

Summing up

  • similar to most commonly used in filtering to find out if a string matches a pattern

  • substring used to extract from a string a substring matching a pattern

  • regexp_split_to_array we use when we want to split a string into an array at a certain delimiter

  • regexp_split_to_table we use when we need not just an array, but we want to pull it into a column

  • regexp_replace is needed in order to replace a substring in a string that matches a certain pattern with a new substring.

And there is nothing complicated about it. In 99% of cases, working with regular expressions relies on these functions. And I sincerely wish you that regular expressions in Greenplum do not turn into problem number 2 for you. Take action!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *