The article was written based on work “Forecasting SQL Query Cost at Twitter”, 2021, presented at the IX IEEE International Conference on Cloud Engineering (IC2E). We share the details while our course on Machine Learning and Deep Learning…
Interactive Query Team at Twitterresponding to the growing need for data analysis petabyte (PB) scale, aims to achieve high scalability and availability. To address performance issues in developing and maintaining SQL systems with increasing amounts of data, we developed a large-scale SQL federation system in Hadoop and Google Cloud Storage (GCS) non-cloud and cloud clusters.
This is done using Presto as the cluster center of the SQL engine. SQL federation system together with other projects under the general name Partly cloudy (Partly Cloudy) is moving towards democratizing data analytics and improving productivity on Twitter.
While running this SQL system, we found that without predicting resource usage, when executing SQL queries, unusual problems arise:
Query scheduling requires an estimate of the current workload in the SQL system. Without proper scheduling, the cluster can become overloaded: just 10 seconds is enough for resource-intensive queries to easily take up most of the cluster’s resources.
Clients of data processing systems want to know the estimate of resource consumption on their request. If the use of resources, such as CPU time, is predicted when executing a query, clients will know approximately how much resources their requests are using.
Elastic scaling requires predicting resource usage when executing queries. Because of the rapid impact of resource-intensive queries, the SQL system must scale before processing them.
To predict resource usage when executing queries, existing database management systems (DBMS) approaches typically use query plans generated from SQL engines. These approaches limited our ability to predict resource usage for query scheduling and preemptive scaling when we weren’t using SQL engines.
We now use machine learning techniques to train two models on legacy SQL query logs to predict CPU usage and peak memory usage. In this article, we will go into detail about building an ML-based system for predicting resource usage for SQL queries on Twitter.
On Twitter, for every SQL query processed by the SQL federation system, a record is created in the query log. The query cost forecasting system uses the query logs as the initial dataset for training.
Each such log contains information related to the request, including the unique identifier, username, environment, and request instruction. According to our experiments, logs for the last three months (90 days) are a good indicator for predicting the cost of online requests. A typical dataset consists of approximately 1.2 million records and over 20 features.
The training cluster performs machine learning computations. To predict CPU usage and peak memory usage, we train two ML models — the processor model and the memory model — from past query logs. The training cluster runs:
Data cleansing and sampling to the original set, converting continuous cpu time and peak memory to segments.
Applying vectorization techniques to extract features from raw SQL statements.
Train models using classification algorithms.
In the repository, we manage the trained models that are stored in a central repository like GCS. In the serving cluster, models are retrieved from storage and placed in the forecasting web service.
This process exposes RESTful API endpoints for external requests, which are used to predict CPU usage and peak memory usage for online SQL queries from Notebook / BI tools (this is how clients get an estimate of resource usage when making queries) and a router (for scheduling requests and priority scaling).
Previous statistical approaches have used regression techniques such as time series analysis to solve DBMS problems. But the distribution of resource consumption when executing SQL queries is exponential, so traditional regression approaches are difficult to use due to large fluctuations in the tail of the distribution.
We also noticed that quick estimates of resource consumption, scheduling, and scaling do not require accurate client and router predictions. You just need to know what the resource intensity of the request is: low, medium or high. To do this, we apply discretization to the original dataset by converting continuous data to discrete data.
Data cleaning and sampling
How do we group these requests? First, we select thresholds for intensive CPU and memory usage – 5 hours and 1 TB, respectively. These values have been used in the SQL federation system in the past based on our experience (including DevOps) when running analytic queries.
Based on DevOps experience, requests with CPU times of less than 30 seconds are considered lightweight. This helps to cover most of the queries in the range [0, 30 с), т. е. более 70 % общего их числа. И лишь 1 % запросов попадает в диапазон [30 , 60 с]…
The distribution of peak memory usage is more even: as a result, we usually categorize requests with peak memory less than 1 TB, and select 1 MB as the boundary for requests with low and medium memory consumption.
Below distribution categories:
processor time in three ranges:[030s)[30s5h)[5h);[030с)[30с5ч)[5ч);
peak memory consumption in three ranges:[01MB)[1MB1TB)[1TB)[01Мб)[1Мб1ТБ)[1ТБ)
After transforming the dataset, we divide it into training data (80% of queries) and test data (20% of queries).
We use natural language processing (NLP) vectorization techniques to generate the necessary features for the transformed dataset with categories of peak memory and CPU time. Each SQL statement is mapped to a vector of numbers for subsequent processing using vectorization. This facilitates the execution of algorithms for classifying text data.
We use a bag of words model in which each word is represented by a number, so that a sequence of numbers can represent an SQL statement. Typical representation is word frequency. We also use the popular inverse document frequency (TF-IDF) representation of term frequency values to create features.
Bag of words models are very flexible, can be generalized to different areas of text data, and provide features without SQL engine calculations and without interacting with the metadata store. We do not use tabular statistics to develop features: this data type requires additional overhead for parsing SQL statements and extracting table metadata.
We also noticed that tree-based machine learning algorithms like XGBoost, which easily determine the importance of features, can include SQL-related features such as access to specific tables and the use of time ranges. These characteristics are commonly used in traditional planning query cost models. In other words, machine learning techniques can help developers learn large-scale SQL systems.
Train and evaluate models
Having prepared the features extracted from TF-IDF, we use the XGBoost gradient hoisting tree algorithm to train the classifiers from the training dataset. To find the optimal hyperparameters, a 3-way cross-validation is used. Then we test the trained classifiers on the test dataset. The processor model achieves an accuracy of 97.9% and the memory model reaches 97.0%.
Accuracy is one of the popular metrics for evaluating model performance, but not the only one. Given the imbalanced classes in the training dataset, high overall accuracy does not always indicate a high predictive ability for all classes. The high accuracy of class definitions containing a large number of samples may mask low accuracy in predicting classes with fewer samples.
To address this potential issue, we consider the precision and responsiveness of each class, especially classes representing CPU- or memory-intensive queries. In the tables below, our models achieve high accuracy and response for all grades as well as high overall accuracy. In particular, they give at least 0.95 accuracy and response for resource-intensive requests:[5hours)and[1TB)[5часов)и[1ТБ)
Having trained and tested the models, we put them into a web application to serve traffic in production in real time. Service in the serving cluster is deployed in containers Aurora on Twitter’s popular Mesos tool.
Each deployment unit is stateless, so the scalability of the application can be increased by increasing the number of deployment replicas. The service provides two RESTful API endpoints to predict CPU usage and peak memory usage when executing an SQL query. The withdrawal time is about 200 milliseconds.
Using machine learning methods, we have developed a system for predicting the cost of SQL queries, which allows us to predict CPU consumption and peak memory usage when executing SQL queries with an accuracy of more than 97%.
Unlike others, the system learns from simple SQL statements and creates machine learning models on the data of logs of previous SQL queries, does not depend on any SQL engines or query plans. We believe that the approach described in this article can provide an innovative solution in terms of performance optimization associated with traditional infrastructure.
And you can learn how to solve business problems using ML in our courses:
Find out the details stock.
Other professions and courses
Data Science and Machine Learning
Python, web development
Java and C #
From the basics to the depth
As well as