Regular Expressions in SQL
REGEXP_LIKE
— checks whether a string matches a given pattern.REGEXP_REPLACE
— replaces the part of the string that matches the pattern.REGEXP_SUBSTR
— extracts a substring based on a pattern.REGEXP_INSTR
— finds the position of the pattern occurrence.REGEXP_COUNT
— counts the number of occurrences of the pattern.
Let's start right away with the practical application of these wonderful functions.
Practical scenarios
Validation of email addresses
Let's say there is a table users
with field email
and there is a suspicion that not all addresses were entered correctly. You need to select all records where email
does not follow the standard format.
We use REGEXP_LIKE
to check the email format:
SELECT user_id, email
FROM users
WHERE NOT REGEXP_LIKE(email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');
Parsing:
^[A-Za-z0-9._%+-]+
— the beginning of the line, valid characters before@
.@[A-Za-z0-9.-]+
– symbol@
and domain part.\.[A-Za-z]{2,}$
— a dot and a top-level domain of at least 2 letters.NOT REGEXP_LIKE
— select records that do not match the template.
As a result, we will receive a list of users with incorrect emails, to whom we can send a notification about the need to update their contact information.
Standardization of telephone numbers
In the table contacts
Phone numbers are stored in different formats: with spaces, brackets, dashes and even without a country code. It is necessary to convert all numbers to a single format +7XXXXXXXXXX
.
We use REGEXP_REPLACE
to remove extra characters and add the country code:
UPDATE contacts
SET phone_number="+7" || REGEXP_REPLACE(phone_number, '\D', '')
WHERE REGEXP_LIKE(phone_number, '^\+?7?\d{10}$');
Parsing:
REGEXP_REPLACE(phone_number, '\D', '')
— remove all non-numeric characters.'+7' ||
— add the country code at the beginning.^\+?7?\d{10}$
— select numbers that can already begin with+7
or7
to avoid duplicate country codes.
The end result will be uniform phone numbers that are easier to work with and can be used for automatic dialing or sending SMS.
Search for specific patterns in logs
There is a table system_logs
with field log_entry
and need to find all records where an access error occurred for users with a certain name pattern, for example, starting with admin_
and ending with numbers.
We use REGEXP_LIKE
to search for matching entries:
SELECT log_id, log_entry
FROM system_logs
WHERE REGEXP_LIKE(log_entry, 'Access denied for user \'admin_\w*\d+\'', 'i');
Parsing:
Access denied for user \'admin_\w*\d+\'
— we look for lines with a message about access denial for a user whose name matches the pattern.\'
— escape single quotes.\w*
– any number of alphabetic characters.\d+
– one or more digits.'i'
— case insensitive flag.
The output is a list of logs containing access attempts by suspicious users.
Masking of personal data
For security reasons, you need to mask credit card numbers in the table payments
leaving only the last 4 digits visible.
Using the function REGEXP_REPLACE
to replace part of a string with characters *
:
SELECT
payment_id,
REGEXP_REPLACE(card_number, '\d{12}(\d{4})', '************\1') AS masked_card_number
FROM payments;
Parsing:
\d{12}(\d{4})
— we look for the first 12 digits and grab the last 4 digits.'************\1'
— replace the first 12 digits with*
and leave the last 4 digits (link to the first exciting group\1
).
Let's get the fieldmasked_card_number
where the card numbers look like ************1234
. This is important to meet security requirements and maintain customer privacy.
Extracting hashtags from messages
In the table social_posts
there is a field content
containing the text of messages with hashtags. We need to extract all the unique hashtags from these posts.
We use a combination of functions REGEXP_SUBSTR
and a recursive query:
WITH hashtags AS (
SELECT
post_id,
REGEXP_SUBSTR(content, '#\w+', 1, LEVEL) AS hashtag
FROM social_posts
CONNECT BY REGEXP_SUBSTR(content, '#\w+', 1, LEVEL) IS NOT NULL
AND PRIOR post_id = post_id
AND PRIOR SYS_GUID() IS NOT NULL
)
SELECT DISTINCT hashtag
FROM hashtags
WHERE hashtag IS NOT NULL;
Parsing:
#\w+
– looking for words starting with#
.LEVEL
— used to extract each subsequent occurrence.CONNECT BY
— we recursively go through each message and extract all the hashtags.DISTINCT
— we leave only unique hashtags.
After execution, we will get a list of all unique hashtags used in messages. This can be useful for analytics, trending topics, or recommendations.
Splitting a CSV string into elements
In the table orders
there is a field product_ids
containing product IDs in CSV format (for example, 1,2,3,4
). You need to create a record for each product in the order.
We use REGEXP_SUBSTR
along with recursion:
SELECT
order_id,
TO_NUMBER(REGEXP_SUBSTR(product_ids, '[^,]+', 1, LEVEL)) AS product_id
FROM orders
CONNECT BY REGEXP_SUBSTR(product_ids, '[^,]+', 1, LEVEL) IS NOT NULL
AND PRIOR order_id = order_id
AND PRIOR SYS_GUID() IS NOT NULL;
Parsing:
[^,]+
— capture sequences of characters that are not commas.LEVEL
— we use it to go through each CSV element.TO_NUMBER
— convert the string to a number ifproduct_id
numerical.
The output is a list of orders, where each line corresponds to one product from the order.
We continue our fascinating journey into the world of regular expressions in SQL! If you're still with me, then neither the mysterious symbols nor the tricky functions scared you. Well, it's time to dive deeper and sort out the nuances that textbooks are often silent about.
The nuances of using regular expressions in different DBMSs
Oracle, MySQL, PostgreSQL
Different database management systems may implement support for regular expressions differently. Let's go over the main differences.
Oracle
Oracle has a full set of functions for working with regular expressions:
REGEXP_LIKE
REGEXP_REPLACE
REGEXP_SUBSTR
REGEXP_INSTR
REGEXP_COUNT
MySQL
In MySQL, support for regular expressions depends on the version:
Before version 8.0 the operator was used
REGEXP
orRLIKE
to check compliance.Features added in version 8.0 and higher
REGEXP_LIKE
,REGEXP_REPLACE
,REGEXP_INSTR
,REGEXP_SUBSTR
.
PostgreSQL
PostgreSQL is famous for its support for regular expressions:
Operator
~
And~*
for matching (case sensitive and insensitive).Functions
regexp_replace
,regexp_matches
,regexp_split_to_table
,regexp_split_to_array
.
The syntax is based on POSIX extensions (like mysql), but with additional. possibilities.
Syntax nuances
Please note the differences in character escaping and special usage. sequences:
In Oracle you need to double backslashes
\\
in some cases.In MySQL and PostgreSQL, one backslash is enough
\
.
Example:
In Oracle:
SELECT REGEXP_REPLACE('ABC123', '[A-Z]+', '') FROM dual;
-- Результат: 123
In PostgreSQL:
SELECT regexp_replace('ABC123', '[A-Z]+', '');
-- Результат: 123
Various techniques
Backlinks allow you to reference captured groups within a regular expression. Let's say you need to find lines where there are repeating sequences of characters:
SELECT column_name
FROM table_name
WHERE REGEXP_LIKE(column_name, '(.*)\1');
(.*)\1
— we capture any sequence of characters and look for its repetition immediately after.
Some DBMSs allow you to use conditionals within regular expressions.
Example (Oracle):
SELECT REGEXP_SUBSTR('abc123xyz', '(abc|123|xyz)', 1, 2) FROM dual;
-- Результат: 123
(abc|123|xyz)
– use the operator |
to indicate alternatives. Parameter 2
at the end of the function indicates that we want to get the second match.
By default, quantifiers in regular expressions are “greedy”, i.e. they capture the maximum possible number of characters.
Example:
SELECT REGEXP_SUBSTR('abcccccd', 'abc*') FROM dual;
-- Результат: abccccc
If we want to make the quantifier “lazy”, then we add ?
after it:
SELECT REGEXP_SUBSTR('abcccccd', 'abc*?') FROM dual;
-- Результат: abc
Parsing:
c*
– greedy quantifier, captures everythingc
.c*?
– lazy quantifier, captures the minimum amountc
.
Performance
Regular expressions are a powerful tool, but resource-intensive.
Adviсe:
Indexes do not work with regular expressions. If possible, use additional conditions in
WHERE
which can use indexes.Limit your sample. Use additional filters to reduce the number of rows processed.
Cache results. If the regular expression is used frequently and the results rarely change, consider caching.
Conclusion
Regular expressions allow you to perform complex operations that would otherwise require additional steps or sometimes scripts.
In conclusion, I would like to recommend a webinar where you will learn how specialists with different roles use databases, how their approaches to SQL queries differ and overlap. In practice, you will try to write queries from different positions and figure out how to effectively work together to achieve common goals. This lesson will help you understand how to create effective collaboration between analysts and developers. Register.