Regular Expressions in SQL

REGEXP_LIKE — checks whether a string matches a given pattern.
REGEXP_REPLACE — replaces the part of the string that matches the pattern.
REGEXP_SUBSTR — extracts a substring based on a pattern.
REGEXP_INSTR — finds the position of the pattern occurrence.
REGEXP_COUNT — counts the number of occurrences of the pattern.

Let's start right away with the practical application of these wonderful functions.

Practical scenarios

Validation of email addresses

Let's say there is a table users with field emailand there is a suspicion that not all addresses were entered correctly. You need to select all records where email does not follow the standard format.

We use REGEXP_LIKE to check the email format:

SELECT user_id, email
FROM users
WHERE NOT REGEXP_LIKE(email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');

Parsing:

^[A-Za-z0-9._%+-]+ — the beginning of the line, valid characters before @.
@[A-Za-z0-9.-]+ – symbol @ and domain part.
\.[A-Za-z]{2,}$ — a dot and a top-level domain of at least 2 letters.
NOT REGEXP_LIKE — select records that do not match the template.

As a result, we will receive a list of users with incorrect emails, to whom we can send a notification about the need to update their contact information.

Standardization of telephone numbers

In the table contacts Phone numbers are stored in different formats: with spaces, brackets, dashes and even without a country code. It is necessary to convert all numbers to a single format +7XXXXXXXXXX.

We use REGEXP_REPLACE to remove extra characters and add the country code:

UPDATE contacts
SET phone_number="+7" || REGEXP_REPLACE(phone_number, '\D', '')
WHERE REGEXP_LIKE(phone_number, '^\+?7?\d{10}$');

Parsing:

REGEXP_REPLACE(phone_number, '\D', '') — remove all non-numeric characters.
'+7' || — add the country code at the beginning.
^\+?7?\d{10}$ — select numbers that can already begin with +7 or 7to avoid duplicate country codes.

The end result will be uniform phone numbers that are easier to work with and can be used for automatic dialing or sending SMS.

Search for specific patterns in logs

There is a table system_logs with field log_entryand need to find all records where an access error occurred for users with a certain name pattern, for example, starting with admin_ and ending with numbers.

We use REGEXP_LIKE to search for matching entries:

SELECT log_id, log_entry
FROM system_logs
WHERE REGEXP_LIKE(log_entry, 'Access denied for user \'admin_\w*\d+\'', 'i');

Parsing:

Access denied for user \'admin_\w*\d+\' — we look for lines with a message about access denial for a user whose name matches the pattern.
\' — escape single quotes.
\w* – any number of alphabetic characters.
\d+ – one or more digits.
'i' — case insensitive flag.

The output is a list of logs containing access attempts by suspicious users.

Masking of personal data

For security reasons, you need to mask credit card numbers in the table paymentsleaving only the last 4 digits visible.

Using the function REGEXP_REPLACE to replace part of a string with characters *:

SELECT
    payment_id,
    REGEXP_REPLACE(card_number, '\d{12}(\d{4})', '************\1') AS masked_card_number
FROM payments;

Parsing:

\d{12}(\d{4}) — we look for the first 12 digits and grab the last 4 digits.
'************\1' — replace the first 12 digits with *and leave the last 4 digits (link to the first exciting group \1).

Let's get the fieldmasked_card_numberwhere the card numbers look like ************1234. This is important to meet security requirements and maintain customer privacy.

Extracting hashtags from messages

In the table social_posts there is a field contentcontaining the text of messages with hashtags. We need to extract all the unique hashtags from these posts.

We use a combination of functions REGEXP_SUBSTR and a recursive query:

WITH hashtags AS (
  SELECT
    post_id,
    REGEXP_SUBSTR(content, '#\w+', 1, LEVEL) AS hashtag
  FROM social_posts
  CONNECT BY REGEXP_SUBSTR(content, '#\w+', 1, LEVEL) IS NOT NULL
  AND PRIOR post_id = post_id
  AND PRIOR SYS_GUID() IS NOT NULL
)
SELECT DISTINCT hashtag
FROM hashtags
WHERE hashtag IS NOT NULL;

Parsing:

#\w+ – looking for words starting with #.
LEVEL — used to extract each subsequent occurrence.
CONNECT BY — we recursively go through each message and extract all the hashtags.
DISTINCT — we leave only unique hashtags.

After execution, we will get a list of all unique hashtags used in messages. This can be useful for analytics, trending topics, or recommendations.

Splitting a CSV string into elements

In the table orders there is a field product_idscontaining product IDs in CSV format (for example, 1,2,3,4). You need to create a record for each product in the order.

We use REGEXP_SUBSTR along with recursion:

SELECT
    order_id,
    TO_NUMBER(REGEXP_SUBSTR(product_ids, '[^,]+', 1, LEVEL)) AS product_id
FROM orders
CONNECT BY REGEXP_SUBSTR(product_ids, '[^,]+', 1, LEVEL) IS NOT NULL
AND PRIOR order_id = order_id
AND PRIOR SYS_GUID() IS NOT NULL;

Parsing:

[^,]+ — capture sequences of characters that are not commas.
LEVEL — we use it to go through each CSV element.
TO_NUMBER — convert the string to a number if product_id numerical.

The output is a list of orders, where each line corresponds to one product from the order.

We continue our fascinating journey into the world of regular expressions in SQL! If you're still with me, then neither the mysterious symbols nor the tricky functions scared you. Well, it's time to dive deeper and sort out the nuances that textbooks are often silent about.

The nuances of using regular expressions in different DBMSs

Oracle, MySQL, PostgreSQL

Different database management systems may implement support for regular expressions differently. Let's go over the main differences.

Oracle

Oracle has a full set of functions for working with regular expressions:

REGEXP_LIKE
REGEXP_REPLACE
REGEXP_SUBSTR
REGEXP_INSTR
REGEXP_COUNT

MySQL

In MySQL, support for regular expressions depends on the version:

Before version 8.0 the operator was used REGEXP or RLIKE to check compliance.
Features added in version 8.0 and higher REGEXP_LIKE, REGEXP_REPLACE, REGEXP_INSTR, REGEXP_SUBSTR.

PostgreSQL

PostgreSQL is famous for its support for regular expressions:

Operator ~ And ~* for matching (case sensitive and insensitive).
Functions regexp_replace, regexp_matches, regexp_split_to_table, regexp_split_to_array.

The syntax is based on POSIX extensions (like mysql), but with additional. possibilities.

Syntax nuances

Please note the differences in character escaping and special usage. sequences:

In Oracle you need to double backslashes \\ in some cases.
In MySQL and PostgreSQL, one backslash is enough \.

Example:

In Oracle:

SELECT REGEXP_REPLACE('ABC123', '[A-Z]+', '') FROM dual;
-- Результат: 123

In PostgreSQL:

SELECT regexp_replace('ABC123', '[A-Z]+', '');
-- Результат: 123

Various techniques

Backlinks allow you to reference captured groups within a regular expression. Let's say you need to find lines where there are repeating sequences of characters:

SELECT column_name
FROM table_name
WHERE REGEXP_LIKE(column_name, '(.*)\1');

(.*)\1 — we capture any sequence of characters and look for its repetition immediately after.

Some DBMSs allow you to use conditionals within regular expressions.

Example (Oracle):

SELECT REGEXP_SUBSTR('abc123xyz', '(abc|123|xyz)', 1, 2) FROM dual;
-- Результат: 123

(abc|123|xyz) – use the operator | to indicate alternatives. Parameter 2 at the end of the function indicates that we want to get the second match.

By default, quantifiers in regular expressions are “greedy”, i.e. they capture the maximum possible number of characters.

Example:

SELECT REGEXP_SUBSTR('abcccccd', 'abc*') FROM dual;
-- Результат: abccccc

If we want to make the quantifier “lazy”, then we add ? after it:

SELECT REGEXP_SUBSTR('abcccccd', 'abc*?') FROM dual;
-- Результат: abc

Parsing:

c* – greedy quantifier, captures everything c.
c*? – lazy quantifier, captures the minimum amount c.

Performance

Regular expressions are a powerful tool, but resource-intensive.

Adviсe:

Indexes do not work with regular expressions. If possible, use additional conditions in WHEREwhich can use indexes.
Limit your sample. Use additional filters to reduce the number of rows processed.
Cache results. If the regular expression is used frequently and the results rarely change, consider caching.

Conclusion

Regular expressions allow you to perform complex operations that would otherwise require additional steps or sometimes scripts.

In conclusion, I would like to recommend a webinar where you will learn how specialists with different roles use databases, how their approaches to SQL queries differ and overlap. In practice, you will try to write queries from different positions and figure out how to effectively work together to achieve common goals. This lesson will help you understand how to create effective collaboration between analysts and developers. Register.

Regular Expressions in SQL

Practical scenarios

Validation of email addresses

Standardization of telephone numbers

Search for specific patterns in logs

Masking of personal data

Extracting hashtags from messages

Splitting a CSV string into elements

The nuances of using regular expressions in different DBMSs

Oracle, MySQL, PostgreSQL

Syntax nuances

Various techniques

Performance

Conclusion

Roadmap for Data Engineer in 2023

Python logger – log output on QTextWidget (PyQt6)

one day in the life of engineers in Dubrovka

my experience in repairing a Powerline adapter and laying out a network

Useful things for a pentester

cutting logs at ALROSA

Leave a Reply Cancel reply

Practical scenarios

Validation of email addresses

Standardization of telephone numbers

Search for specific patterns in logs

Masking of personal data

Extracting hashtags from messages

Splitting a CSV string into elements

The nuances of using regular expressions in different DBMSs

Oracle, MySQL, PostgreSQL

Syntax nuances

Various techniques

Performance

Conclusion

Similar Posts

Leave a Reply Cancel reply