Search for toxic SQL queries

We, students from MEPhI, Daniil and Alexander, came for an internship at Sberbank in the SberData department, which is developing an internal corporate analytical platform (CAP). This is a modern platform with convenient tools created to cover the full range of Sber needs in working with data, such as storage, integration, various analytics, reporting, modeling and data quality control. All these areas would be difficult to develop without a separate R&D division, within which we work. Today we want to share our research into algorithm design in identifying toxic SQL queries using machine learning. Why are queries called “toxic”? They spend too many resources, namely time, on their implementation. In fact, not only time, but for simplicity we will only count time, since this is the key parameter.

The article is devoted to the study of existing approaches and their testing on open data. Data from benchmarks such as TPC‑H and BIRD were selected as publicly available data. In addition, the article discusses some of the challenges we encountered while working on the task, such as generating data and SQL queries, as well as migrating between SQL dialects. At the end of the article we will describe the original approach that we ultimately came to. In the next article we will talk about applying the experience gained to a real industrial system.

Our solution to the problem can be divided into several points:

Review of existing methods

Let's look at three different approaches:

1.1 One of these is graph neural networks – the neural network architecture allows you to work with data presented in the form of an adjacency matrix through graph convolution. In this work (Query Performance Prediction for Concurrent Queries using Graph Embedding) The query execution plan is represented as a graph, the vertices of which are the query operators and their parameters, and the edges reflect the connections between these operators. This method makes it possible to consider multiple requests at once, which allows you to take into account competition for resources when executing multiple requests simultaneously. The method provides high prediction accuracy, but requires large amounts of data for effective training. An example of constructing a neural network for a set of operators is shown in the figure below.

1.2 We then turned our attention to idea (Plan‑Structured Deep Neural Network Models for Query Performance Prediction) treating each query statement as a separate element. It is proposed to use a modular architecture of a deep neural network, where each query plan operator is displayed as a separate module. These modules are combined into a single network that follows the structure of the query plan and allows you to accurately predict latency at each stage of query execution. This approach allows you to flexibly and accurately model complex SQL queries, taking into account their structure and resource intensity. The process of the model (full cycle) is shown in the figure below.

1.3 These methods are used to predict the execution time of SQL queries based on the analysis of their execution plans, but it is worth considering that in a real system there is not always access to the query execution plan. For this reason, a method based on vectorization of the query text using methods such as TF‑IDF and Bag of Words (BoW) stands out from other approaches. For example, in research Text vectorization methods are used in combination with various models to predict the execution time of queries and their resources. The architecture of the model presented in the article is shown in the figure below.

Thus, we will describe the pros and cons of each method:

Approach

Peculiarities

Cons

Query Performance Prediction for Concurrent Queries using Graph Embedding

Takes into account the execution of several requests simultaneously. Representation of a query plan as a graph.

Requires a query plan

Requires a large amount of data

Plan‑Structured Deep Neural Network Models for Query Performance Prediction

Modular view of the query plan

Requires a query plan

Forecasting SQL Query Cost on Twitter

No query plan is used. The query text is vectorized.

Text vectorization does not allow the order of operators in a query to be taken into account

Description of selected models and metrics

Below we will use the concept of a query execution plan – a detailed description of the sequence of operations that the DBMS (database management system) performs to obtain the result of an SQL query. In order not to interfere with industrial systems, we will use the query text. But on the plus side, only two parameters from the query plan will be available to us – cost and row. Cost (cost estimate) indicates the estimated cost of executing each operation in the query plan. Row shows the number of rows that the DBMS estimates will be read.

For the features we have chosen, we will describe the vectorization algorithms and regression models used.

Vectorization algorithms:

  • BoW (Bag of Words): converts text into numeric features, counting the frequency of each word in the document;

  • TF‑IDF (Term Frequency‑Inverse Document Frequency): converts text into numeric features, taking into account the frequency of the word in the document and the inverse frequency of words in the text corpus;

  • BERT (Bidirectional Encoder Representations from Transformers): Used to create vector representations of texts. The model is pre-trained (bert-base-uncased) on big data and is able to capture contextual dependencies between words in a sentence. Query texts were converted into tokens using BERT Tokenizer, and then the tokens were used for feature extraction using BERT Model.

Regression Models:

  • Lasso — regularized linear regression algorithm using L1 penalty;

  • Ridge — an algorithm similar to Lasso, but using an L2 penalty;

  • Random forest — random forest used for the regression task;

  • Neural network — a classic fully connected multilayer neural network implemented in Pytorch;

  • SVR — support vector machine for regression problem.

Metrics:

  • Accuracy – for a regression problem, if the predicted value coincides with the true value with an error ε%, then it is considered as a correct prediction. Next, a window of 25% error is used.

    ACCURACY=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\left| \hat{y}_{i}-y_{i}\right|\leq \ varepsilon \cdot \left|y_{i}\right|)

In these formulas\hat{y}is the model's response, and y – this is the true meaning.

Testing models on TPC-H

We were unable to immediately gain access to the data, and the first step we decided to practice was on “cats”. Initially, the benchmark was chosen for training models and preliminary assessment TPC‑Hwhich is mainly used to compare the performance of analytical systems and data warehouses. We decided to take this one for the test, since it already has more than five tables linked by key and there is no need to generate data for the test. While testing the models, we were faced with the fact that there were not enough SQL queries contained in the selected benchmark, so a module was developed to generate them using language models, which will be described below. To test the operation of the models, data was generated – the average execution of SQL queries up to tens of seconds. During testing, the results presented in the table below were obtained.

Regressor

MAPE (%)

Accuracy (%)

TF‑IDF

BERT

TF‑IDF

BERT

Lasso

120.76

120.76

13.33

13.33

SVR

148.57

125.61

10.00

13.33

Random Forest

90.55

101.49

20.00

3.33

Neural Network

86.12

171.51

0.00

16.67

It can be seen that the prediction accuracy rates are quite low, which means we can conclude that there is little data for training, so it is impossible to say about success or failure. Due to the large variety of words in queries, the dimension of the feature space for this example exceeds the amount of training data and the model cannot achieve good generalization ability. In this regard, you next need to select a new benchmark, which will have more tables and increase the number of SQL queries.

SQL query generation module

Due to the lack of SQL queries, we decided to write our own small generation module, in which you can find link. The operating principle is based on a large language model – GigaChat. For our tasks, it did an excellent job of generating additional SQL queries. Here is the algorithm:

  1. There is a call to GigaChat indicating the generation of a certain number of SQL queries;

  2. SQL queries are extracted from the received text;

  3. They are validated on a small database to check the correctness of the syntax.

During development, the following features were identified:

  • it is better to indicate restrictions in the generated queries at a time, then their diversity will be higher (selected using an empirical method);

  • You can check for uniqueness; the language model’s responses can be repeated.

We managed to collect about 300 unique queries in addition to the existing ones. But practice has shown that this is still not enough, and it was decided to find a database with a large number of tables to increase the number of training samples.

Testing models on BIRD

In order to obtain a data set with an initially large number of queries and tables, a benchmark was selected at the next stage BIRDdesigned to solve the text2sql problem. It includes many databases, each of which has a certain number of queries. A table with the top 7 databases by number of requests is presented below.

DB name

Number of requests

Number of tables

works_cycles

473

65

public_review_platform

380

15

retail_world

360

8

movie_3

314

16

mondial_geo

292

34

soccer_2016

257

21

retails

242

8

Our final choice fell on the base public_review_platform due to the large number of requests. At the same time, the number of tables in it is small – this will make it easy to migrate to postgres from the original sqlite, on which the entire BIRD benchmark is made. More details about migration will be below.

Also at this step we decided to add additional query preprocessing to increase accuracy, namely:

  • Formatting (converting to a single case and replacing multiple spaces);

  • Removing comments (both single-line and multi-line);

  • Replacing aliases with full table names;

    • Was:

      SELECT u.user_id, u.user_average_stars, r.review_stars FROM users u JOIN reviews r ON u.user_id = r.user_id WHERE r.review_length=”Short” AND r.review_votes_useful=”Medium”;

    • Became:

      select users.user_id, users.user_average_stars, reviews.review_stars from users join reviews on users.user_id = reviews.user_id where reviews.review_length=”short” and reviews.review_votes_useful=”medium”;

  • Adding additional features from the query plan (approximate cost of the query and the number of rows processed).

Moreover, additional queries were generated against the database using the query generation module described above. Thus, the number of requests increased to 600, and, in addition, the size of the database was expanded from 100 MB to 15 GB in total. Then testing was carried out on the obtained data, the results of which are presented below.

TF‑IDF

Regressor

MAPE (%) (with cost and rows)

MAPE (%) (without cost and rows)

Accuracy (%) (with cost and rows)

Accuracy (%) (without cost and rows)

XGBoost

33.30

136.86

80.49

67.07

SVR

98.81

435.06

62.20

64.63

Random Forest

32.22

66.99

76.83

43.90

Ridge

277.41

118.75

46.34

34.15

Neural Network

109.40

142.47

47.56

46.34

BoW

Regressor

MAPE (%) (with cost and rows)

MAPE (%) (without cost and rows)

Accuracy (%) (with cost and rows)

Accuracy (%) (without cost and rows)

Lasso

89.94

89.94

24.39

14.39

SVR

1509.46

139.86

54.39

49.51

Random Forest

45.29

70.62

64.15

50.73

XGBoost

77.87

130.64

58.05

53.17

Neural Network

117.44

167.93

43.41

37.32

BERT

Regressor

MAPE (%) (with cost and rows)

MAPE (%) (without cost and rows)

Accuracy (%) (with cost and rows)

Accuracy (%) (without cost and rows)

Lasso

89.94

201.7

43.90

24.39

SVR

258.3

594.61

64.63

62.20

Random Forest

56.20

68.40

46.34

37.80

XGBoost

35.48

223.24

76.83

47.56

Neural Network

191.81

93.61

53.66

39.02

Based on the analysis of the tables, the best regression models were identified; in particular, random forest and gradient boosting showed the best results. This is because random forest is an ensemble (bagging) method, just like gradient boosting, which allows for better generalization of data and more stable results.

When considering various text vectorization methods, it was found that the bag of words turned out worse on average, since the method simply counts the number of occurrences of words, without taking into account their relationships and context. In contrast, TF‑IDF performed well because it takes into account the frequency of occurrence of each word in documents, which prevents overestimation of frequently occurring terms (for example, SQL statements) and allows you to highlight rare but significant words, such as table or column names. BERT also showed good results, but further testing was carried out on the TF‑IDF vectorizer due to its high speed. However, the final model can be run on BERT, as it is able to take into account the semantics of words, which improves the accuracy of the model.

Max Features

Regressor

MAPE (%)

Accuracy Reg (%)

128

Random Forest

38.06

76.83

XGBoost

33.3

80.49

SVR

98.81

62.20

256

Random Forest

29.16

75.61

Catboost

82.96

62.20

XGBoost

33.31

84.15

512

XGBoost

33.31

84.15

Random Forest

37.10

71.95

SVR

126.91

63.41

1024

Random Forest

27.67

76.83

XGBoost

33.31

84.15

Linear

234.78

65.85

2048

Random Forest

28.01

76.83

XGBoost

33.31

84.15

SVR

126.91

63.41

4096

Random Forest

31.28

78.05

XGBoost

33.31

84.15

Catboost

82.96

62.20

8192

Random Forest

37.13

73.17

XGBoost

33.31

84.15

SVR

126.91

63.41

When analyzing the dependence of model accuracy on the maximum number of features in the TF‑IDF vectorizer, it was noticed that with an increase in the number of features, the accuracy of the models increases. This is because a larger vector size allows for more table columns and query keywords to be taken into account. However, the excess vector size no longer brings additional benefits, and the accuracy of the models stabilizes. Therefore, the vector dimension should be selected based on the average and maximum number of tables and columns in the system. For our case, the optimal value for the maximum vector size turned out to be 4096, but if faster work is needed, a size of 1024 can be used. The choice of dimension is explained by the size of the “dictionary” that is used in queries, in particular, with the number of unique names of tables and columns. Looking ahead, let's say that on real data a larger vector is needed, since the power of the dictionary is greater.

While conducting research on this benchmark, we encountered some difficulties, namely data generation and migration from SQLite to PostgreSQL, for which certain modules were developed. We will describe the situation in more detail below.

Data generation module in the database

Not all benchmarks have modules for data generation, so we had to write a tool that solves our problem, which can be studied using link. For full functionality, we generalized the generation to an arbitrary database structure. The main features of the module are presented below:

  1. Entering a database schema for data generation;

  2. Support for foreign keys for correct generation (a naive check occurs among already generated data);

  3. For each column, you can set values ​​to generate (for example, a list of cities);

  4. Writing to a file or row writing in PostgreSQL.

Migration from SQLite to PostgreSQL

The BIRD benchmark uses SQLite as a dialect, whose data types differ from PostgreSQL. Simply using SQLite is not advisable, so we migrated to PostgreSQL. Therefore, we needed to perform the following steps for migration (which can be done using manual analysis and regular expressions for replacement):

  1. Translate table descriptions into the correct format;

  1. Change all data types in accordance with the content (for example, convert some “text” columns to “time”);

  2. Convert table description files into .sql files;

  1. Make data changes to queries (for example, translate “11AM” to “11:00:00”).

Conclusions

For PostgreSQL, we managed to achieve results with which we can continue to work on real DBMSs (on the BIRD benchmark: ~30% deviation / 75–80% accuracy). The best vectorizers turned out to be TF‑IDF and BERT, but the latter is more resource-intensive. And the best accuracy was shown by the gradient boosting (XGBoost) and random forest (Random Forest) regressors due to their ensemble structure.

Main advantages of the model:

  • Based on the request text, which is easy to obtain in the system;

  • Good interpretability.

Disadvantages and possible nuances during further work:

  • If there are a large number of tables, the vector will be too large (the feature space will probably need to be reduced).

Authors: Uskov Daniil(@Dusik207) and Dikov Alexander (@DikovAlexandr) under the supervision of Maxim Radionov (@tr1ggers), Igor Dmitriev (@indmitriev) and Vladimir Zvorygin (@Wocher), members of the Sber professional community DWH/BigData.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *