strtree - regular expression based string classifier

Do you want to find short regular expressions that completely and accurately separate one class of string from another? This article is for you. We will talk about the task of classifying strings using automatically detected patterns, and at the end I will provide an example of such a procedure with Python code. We will use a small open-source library strtree – the name contains the word tree, since the regular expressions in it are arranged in the form of a binary tree, each vertex of which has only one outgoing edge.

Why should we use regular expressions to classify strings at all? This problem may occur under the following conditions:

We are dealing with short strings consisting of one or several words – tokenizing such strings to apply standard algorithms often does not make sense;
The order of the characters in a string is often critical;
We assume that the strings of one of the classes have something in common that also distinguishes them from the other class;
Rows are mostly unique.

Such data can be, for example, IP addresses, models of smartphones, computers, parts from them, user names, email addresses, etc. In these cases, the usual bag-of-words with further classification of tabular data is poorly applicable: bag-of-words erases the difference between strings with the same characters, but in a different order. For us, “ABC-1” and “1-CBA” are two fundamentally different strings that describe two different objects (for example, two different devices). Bag-of-words works well, for example, for sentiment analysis, but when classifying strings that are sensitive to shuffling, it performs worse. Well-chosen regular expressions, on the other hand, do not have this problem.

Another advantage of regular expressions under these conditions is good performance for short strings and a small number of examples. At the same time, competing methods (for example, neural networks) might not have time to catch the pattern under such conditions.

Of course, manually finding short and precise regular expressions for real data is difficult. Especially when the pattern is not visible, there is a lot of data, or when the task needs to be repeated daily. Obviously, in this case, you want to search for regular expressions automatically. And that's exactly what we'll do!

So, this article proposes a method for automatically constructing regular expressions that separate one class of string from another with the required precision. This method is implemented in the library streetreewhich I mentioned above. And now I propose to start studying the algorithm itself and consider an example with code.

Algorithm

First, let's describe the problem a little more formally.

We have two sets of rows of corresponding classes: And . The same string can appear multiple times, either in just one class or in both classes at the same time.

Our goal is to find such a set of regular expressions $\{r_i\}$ which would make an assumption about the membership of an arbitrary string in the class . The assumption will be made using the operation matchapplied to each regular expression . If for at least one regular expression the result of the operation is positive, then we consider that the string belongs to class .

You might think that the classifier's answer can only be absolute: the string either belongs to the class , or does not belong. However, as I will show below, we can also obtain probability belonging of a string to a class.

Making up regular expressions

How to get the necessary regular expressions? Let's start with the fact that the set we need always exists. As it we can simply use all the lines from the class . If the set weakly intersects with the class , the accuracy of such a classifier will be high – up to 100% with a complete absence of intersections. However, in this case, the set may be too long, and the classifier will not work correctly when seeing new rows that were not present in the set . In addition, if you use an algorithm to find general patterns in a class (for example, for EDA), then you will not get any useful information from such “patterns” – they do not generalize anything. To avoid such problems, we need to look for shorter but more precise regular expressions.

To do this, we will create each regular expression incrementally, starting with an empty pattern and extending it until we either reach the minimum acceptable precision (i.e. the proportion of lines from the class among all strings matching the expression; let's denote it as minPrecision), or until the regular expression becomes too long and no longer covers any lines. Each time we matched (or failed to match) a regular expression corresponding to the last step of a line from the class we will consider them “covered” and will not consider them for the next iterations. So we will collect regular expressions until for each line from the class we may (or may not) be able to match the pattern.

How will we lengthen regular expression? We will add to it one by one token (one building block for a regular expression; more on this below), and among the combinations of expression and tokens, find the best extension option in terms of f1_score on currently uncovered lines. If, when adding tokens, the new expression does not cover a single line from the uncovered part of the class (those. f1_score = 0 for all tokens), we will assume that there is no extension and start looking for the pattern again.

A little bit about tokens. Tokens – these are transformed n-grams length 1, obtained from class strings . It plays the role of a minimal pattern, a kind of building block when constructing a regular expression. Let's look at the process of composing tokens using an example.

Let's say consists of one line “string 1”. N-grams of length 1 for a set are “s”, “t”, “r”, “i”, “n”, “g”, ” “, “1”. After conversion they will become tokens “s.+”, “.+t.+”, “.+r.+”, …, “.+1”as well as tokens such as “.+\d” (instead of “1” at the end of the line – any number at the end), “[a-z].+” (instead of “s” at the beginning – any lowercase letter at the beginning), and others. That is, tokens are short patterns that “in general terms” describe a set that will be added to the current regular expression. There can be many ways to convert an n-gram into a token. The wider the token after conversion, the wider the regular expressions in the final set can be.

Probability of class membership

I mentioned that the classifier is capable of producing not only binary values, but also the probability of the string belonging to the class . It happens very simply. When adding a regular expression to a list $\{r_i\}$ they checked its precision score, that is, the proportion of lines from the class among all strings that match the regular expression. Thus, if an arbitrary string matches some regular expression, we can use the precision score of this regular expression as an approximation of the probability that the string belongs to the class .

“Pseudocode”

Repeat until all lines are out will not be marked as “covered”:

Create an empty regular expression;
Repeat until expression extension no longer exists or its precision reaches minPrecision:
1. Lengthen a regular expression by looping through the tokens and selecting the combination of token and expression with the highest f1-score on uncovered lines;
If the last regular expression exists and its precision is greater than or equal to minPrecisionadd expression to list $\{r_i\}$ ;

Mark lines from S1 that match the last regular expression found as “covered”.

Example

In this simple example using the library strtree looking for regular expressions to classify fictitious names of smartphones. Namely, in the example:

Classifying regular expressions are defined
The precision score and recall score are calculated to classify training strings
Predictions with triggered regular expressions and their accuracies for training strings are obtained

You can install the library using pip: pip install strtree

The library also has a small documentation.

import strtree


strings = ['Samsung X-500', 'Samsung SM-10', 'Samsung X-1100', 'Samsung F-10', 
           'Samsung X-2200', 'AB Nokia 1', 'DG Nokia 2', 'THGF Nokia 3', 
           'SFSD Nokia 4', 'Nokia XG', 'Nokia YO']
labels = [1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0]

tree = StringTree()
tree.build(strings, labels, min_precision=0.75)

for leaf in tree.leaves:
    print(leaf)
# PatternNode(".+ .+a.+", right=None, left=PatternNode(.+0.+), n_strings=11, 
#              precision=1.0, recall=0.57)
# PatternNode(".+0.+", right=None, left=None, n_strings=7, 
#             precision=1.0, recall=1.0)

print('Precision: {}'.format(tree.precision_score(strings, labels)))
# Precision: 1.0

print('Recall: {}'.format(tree.precision_score(strings, labels)))
# Recall: 1.0

matches, matched_nodes = tree.match(strings, return_nodes=True)
# нou will receive a vector of the same size as strings, 
# containing 0's (no match) or 1's (match), 
# and also the nodes containing regular expression and its precisions for each sample

If you have any questions regarding the operation or application of the algorithm, library, or you just want to chat, I invite you to comment!