load testing without stress part 2 – automation
To conduct load testing without stress, you need to take care of preserving your resources – time and nerves. It also seems right to me to free myself from routine and tedious tasks in order to engage in interesting and complex tasks.
To achieve these goals, I developed a five-point anti-stress checklist for myself:
Automate the launch of measurements
Delegate routine to a bot
Save time on reports
Trust, but check the results
Take care of test data
We will follow them today:
Automate the launch of measurements
If you are new to NT, most likely you will be satisfied with creating simple scripts to get results as quickly as possible. If measurements are carried out once a year, then running the script locally and taking rough notes on how it works is quite acceptable.
Let me remind you that my project for NT uses the k6 tool, which requires either pure JavaScript code or plugin libraries written for k6.
Example of local launch of load supply and measurements on k6
But if your NT team is a regular activity, and the team is involved in the analysis of measurements, then there is a reason to think about automation.
Let me remind you that the simplest transfer of your script to any CI/CD service will allow you to:
do not waste mental fuel remembering how to run the script and what to substitute in the input parameters;
give up a ton of documentation to transfer your knowledge about the operation of the script to colleagues.
Our launch of NT measurements is documented in CI/CD:
K6 is raised in the Docker container and used in the following steps:
build-k6:
stage: build
rules:
- when: always
script:
- |+
docker run --rm -u "$(id -u):$(id -g)" -v "${PWD}:/xk6" grafana/xk6 build v0.54.0 \
--with github.com/GhMartingit/xk6-mongo@v1.0.3 \
--with github.com/avitalique/xk6-file@v1.4.0 \
--with github.com/grafana/xk6-faker@v0.4.0 \
> docker.log 2>&1
cache:
policy: push
key: k6_binary
paths:
- ./k6
artifacts:
paths:
- "*.log"
Before starting measurements, we check the readiness of the circuit for operation:
check:
stage: check
rules:
- when: always
script:
- ./k6 run checkStand.js -e configFile=config/preparation.json -e host="$STAGE.$PROJECT.$DOMEN" -e dbName="${STAGE}_${PROJECT}"
cache:
key: k6_binary
policy: pull
paths:
- ./k6
The check file is also implemented taking into account the features of k6:
import {
expectedCountDocument,
mongoClass,
} from "../../methods/base.method.js";
import { Counter } from "k6/metrics";
export const CounterErrors = new Counter("Errors");
export const options = {
iterations: 1,
thresholds: {
"Errors{case:user}": [
{ threshold: "count<1", abortOnFail: true }
],
"Errors{case:load_object_read}": [
{ threshold: "count<1", abortOnFail: true },
],
"Errors{case:load_object_write}": [
{ threshold: "count<1", abortOnFail: true },
],
},
};
export default function () {
const userCount = mongoClass.count("user");
if (userCount < expectedCountUser || userCount === undefined) {
CounterErrors.add(1, { case: "user" });
}
...
const loadObject1ReadCount = mongoClass.count("load_object_read");
if (loadObject1ReadCount < expectedCountDocument) {
CounterErrors.add(1, { case: "load_object_read" });
}
...
mongoClass.deleteMany("load_object_write", {});
const loadObject1WriteCount = mongoClass.count("load_object_write");
if (loadObject1WriteCount > 0) {
CounterErrors.add(1, { case: "load_object_write" });
}
}
The script checks the adequacy of the data – for example, the number of accounts for authorization diversity or the amount of data in the MongoDB collection to measure the reading of large registries.
By the way, sufficiency is not only about presence, but also about absence. For example, for scripts that create records in the database, in our project a mandatory condition is a clean collection before starting measurements.
You can also include verification of configuration, circuit parameters, etc. in the verification step.
k6 functions allow you to add result processing for executed script steps.
So if any of the checks “falls”, then the pipeline step will also not be executed, and the measurement will not be started.
Our NT launch is configurable:
When starting the pipeline, you can choose:
variables:
STAGE:
value: "15-0-x-autotest"
description: "Название контура"
SCRIPT:
description: "Сценарий нагрузки"
value: "scripts/all"
options:
- "scripts/all"
- "scripts/auth-api"
- "scripts/featureN"
- "scripts/websocket"
CONFIG:
description: "Профиль нагрузки"
value: "config/smoke"
options:
- "config/smoke"
- "config/rampRate"
- "config/stability"
Load profile determines how long and what load flow will go to the circuit.
smoke — this profile is used to check the circuit and the functionality of scripts. Each script is executed 5 times.
rampRate — target profile for measurements. The load is supplied in several stages with a gradual increase in flow.
stability — when choosing this profile, the load is supplied for several hours (usually 6-8 hours) with a constant RPS indicator.
Load scenario allows you to select a set of scripts for applying the load.
all — this script launches regression NT; Every script in the repository will be executed.
auth-api — a scenario, when selected, only scripts are launched to load the authorization service.
beautyN – this is an example; you can run NT even for a separate functionality!
Grafana Labs has good postwhich explains the differences between different types of profiles (however, in the article they are designated as Types of load testing).
I am sure that you have a logical question: how can you determine which load is stressful and which is normal?
And that's a damn good question that I don't have an answer to.)
These indicators are individual for each service, functionality or system. They are determined empirically by comparing the behavior of the tested object under certain allocated resources and a gradually increasing load.
But let's get back to the code)
Our measurement launch is wrapped in a bash script. Perhaps this is unnecessary and rudimentary. Previously, launch required the preparation of variables, which we did not want to litter in gitlab-ci.yml.
run:
stage: measure
rules:
- if: $CI_PIPELINE_SOURCE == "pipeline"
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_PIPELINE_SOURCE == "push"
when: never
- when: on_success
script:
- npm i
- start=$(date +%s)
- echo run.sh -h $STAGE -p $PROJECT -s $SCRIPT -d $DOMEN -c $CONFIG
- ./run.sh -h $STAGE -p $PROJECT -s $SCRIPT -d $DOMEN -c $CONFIG
- end=$(date +%s)
- echo $start > start
- echo $end > end
after_script:
- dashboardLT="${URL_GRAFANA_K6_FOR_QA}&from=$(cat start)000&to=$(cat end)000"
- dashboardServices="${URL_GRAFANA_FACTORY_SERVICES}&from=$(cat start)000&to=$(cat end)000"
- node tools/scripts/notification.js $STAGE $CI_COMMIT_BRANCH $CI_TELEGRAM_CHAT $CI_TELEGRAM_TOKEN $CI_JOB_ID $SCRIPT "$dashboardLT" "$dashboardServices" $CONFIG
cache:
key: k6_binary
policy: pull
paths:
- ./k6
artifacts:
paths:
- ./report
expire_in: 2 week
name: ${STAGE}
And the bash script itself…
#!/bin/bash
while getopts h:p:s:d:c: flag
do
case "${flag}" in
h)
STAGE=${OPTARG}
;;
p)
PROJECT=${OPTARG}
;;
s)
SCRIPT=${OPTARG}
;;
d)
DOMEN=${OPTARG}
;;
c)
CONFIG=${OPTARG}
;;
*)
echo "Не корректный ключ. Проверьте введенные ключи $OPTARG";;
esac
done
[ -z "$STAGE" ] && STAGE="15-0-x-autotest"
[ -z "$PROJECT" ] && PROJECT="factory"
[ -z "$SCRIPT" ] && SCRIPT="scripts/all"
[ -z "$DOMEN" ] && DOMEN="lowcode"
[ -z "$CONFIG" ] && CONFIG="config/smoke"
if [[ $SCRIPT == "scripts/all" ]]; then
folderPath=$(find scripts/* -type d )
else
folderPath="$SCRIPT"
fi
for path in $folderPath; do
for script in "$path"/*.js; do
./k6 run "$script" -e configFile="$CONFIG" -e host="$STAGE.$PROJECT.$DOMEN" -e dbName="${STAGE}_${PROJECT}" --insecure-skip-tls-verify -o experimental-prometheus-rw
echo "$script", "$?" >>exitCode.txt
done
done
Depending on the value passed for the load script, scripts from the specified folder will be launched sequentially.
After each script is executed, the execution code is written to the exitCode.txt file.
We record both the start time and the end time of the overall measurement – these indicators are inserted into the Grafana link, which the bot will send after the measurements are completed.
I will tell you in detail about the bot and the use of the exitCode.txt file below.
What could be the next steps in the development of autorun and configuration?
Integrating the launch of measurements under average load into the regular launch of release checks.
Carrying out NT with dynamic scaling.
Perspectives on testing endurance, stress and recovery of a system or product.
Delegate routine to a bot
How not to twitch when checking whether the HT measurements are completed?
Notification can help us! There is a lot of documentation and examples on the Internet on implementing bots for various instant messengers.
Let's look at the implementation of notification in Telegram. A prerequisite is the preliminary preparing the bot.
Due to use k6we will need either pure Javascript code or a plugin library written for k6.
Therefore, we have several options:
implement notification via the Telegram API;
use k6 plugin – xk6-telegram.
The second option seems simpler, let's start with it.
There is a section with extensions on the k6 documentation page.
Extensions are created and maintained by both the k6 developers and the community around the project.
xk6-telegram is one such extension. It is developed by the community and is listed as officially offered by k6, but is not guaranteed to be supported. However, you can use it to get started)
We take the code as a basis example from documentation:
import http from "k6/http";
import telegram from "k6/x/telegram";
const conn = telegram.connect(`${__ENV.TOKEN}`, false);
const chatID = 123456789;
const environment="load_stand";
const link = 'https://grafana.com/';
export default function () {
http.get('http://test.k6.io');
}
export function teardown() {
const body = `<b>Load testing</b> ${environment} \r\n`+
`<b>Dashboard</b>: <a href="https://habr.com/ru/companies/eftech/articles/856240/${link}">K6 Result</a>`;
telegram.send(conn, chatID, body);
}
What do we get?
The k6 tool executes the code in the script sequentially: setup (pre-preparation), function (main code), teardown (post-preparation).
As soon as the measurements are completed, with or without an error, at the final stage (teardown) a message will be sent to the specified chatID.
The code for sending a message can be unified for all scripts and placed in a separate inherited method.
For my project, a simple notification of the completion of calculations is not enough.
I would like to receive the minimum necessary information about the measurement performed.
To make this possible, I needed to implement the logic for processing the results and collecting data. And sending a notification is implemented through a Telegram API call.
In our project, notification is brought to the level of the entire measurement launch.
IN .gitlab-ci.yml in the block after_script called
node tools/scripts/notification.js args
tools/scripts/notification.js contains all the logic for preparing and sending a message to the bot.
You can view the full text of the code follow the link.
The data declared in the job pipeline is passed to the script as input:
the name of the circuit on which measurements were run (we prepare our own circuit for each release – for a scenario when it is necessary to take measurements in the current and previous releases);
branch with scripts that were executed;
load profile name;
jobID (to send a link to Gitlab artifacts in a message).
Next, the data for the message is prepared.
It is worth noting that the exitCode from the execution of k6 is written to a separate file.
For each script it is called k6 run script.js, and the result is written to a file exitCode.txt.
At the stage of preparing data for the message, the file is processed and a color gradation of the success of execution is constructed. This is not necessary, but it allows you to immediately navigate the result.
Only then is the message sent.
Save time on reports
So, we have an automated launch in 2 clicks, and the bot notifies us in a timely manner when measurements are completed. This means that the next stage is reporting automation.
Whatever service you choose to feed the load, it almost always has a standard report with the minimum necessary metrics.
In k6 the output looks like this:
Such a cut is enough to draw conclusions about the measurements and plan the next actions.
Now let’s imagine that it is necessary to make several diverse measurements and summarize their results into an easy-to-read report for management.
Failure story: I used to roll up my sleeves, dive into the ocean of numbers and add them up manually. And I got a good burn out from it!
Anti-stress recommendation: take the time to set up a simple Prometheus-Grafana integration.
In our project, k6 collects load metrics, transmits them to Prometheus, and Grafana builds graphs and table summaries based on this data.
As a result, we get a single storage of results and dynamic visualization:
If you are just starting out in NT or are looking at load services, then paid Enterprise solutions are not available to you. But what interesting opportunities they have!
For example, Grafana Cloud offers to process measurement results and show them in comparison between several iterations.
When you really want it, but it’s expensive, the only way out is to create your own “bicycle.”
Our DevOps have prepared our Grafana dashboard for analyzing load metrics, including the ability to compare between several measurements:
Let's figure out how it works.
Imagine a situation where NT was executed in several iterations overnight.
The load grew in stages – and so on for each of the scripts.
{
"summaryTrendStats": ["avg", "p(90)", "p(95)", "p(99)", "count"],
"scenarios": {
"rampRate": {
"executor": "ramping-arrival-rate",
"maxVUs": 400,
"preAllocatedVUs": 1,
"timeUnit": "1s",
"stages": [
{ "target": 5, "duration": "1m" },
{ "target": 5, "duration": "5m" },
{ "target": 10, "duration": "1m" },
{ "target": 10, "duration": "5m" },
{ "target": 20, "duration": "1m" },
{ "target": 20, "duration": "5m" },
{ "target": 40, "duration": "1m" },
{ "target": 40, "duration": "5m" }
]
}
}
}
What does such a profile do?
stage0: minute increase in load up to 5 RPS
stage1: stable load at 5 RPS
stage2: minute increase in load up to 10 RPS
stage3: stable load at 10 RPS
stage4: minute increase in load up to 20 RPS
stage5: stable load at 20 RPS
stage6: minute increase in load up to 40 RPS
stage7: stable load at 40 RPS
In the morning you sit down to analyze the results.
The general summary allows us to draw preliminary conclusions, but more precise comparisons are needed to draw conclusions about quality.
The ideal would be a point-to-point comparison of readings for each scenario.
For example, with this pivot table:
But this solution is not ideal either, because the load was supplied in stages with increasing flow!
This is how we visualize the execution of the “Login” script with gradation by load stages:
Tile code: https://disk.yandex.ru/d/j80WalY5BAqZ8g
As a script creator, it is clear to me that when the load increases to 20 RPS (stage4), degradation in authorization begins.
And if you look even deeper, you can find out that the problem is specifically in the logout request.
But there is one “but”…
I’m not the only one who uses the report, and the graphs need to be easy to read and not raise any questions.
Therefore, we are thinking about how to compare graphs based on the applied load in RPS.
We're starting to get something like this:
The conclusions are the same: up to a certain load, the service API copes with the flow, and after that the “sausage” begins.
You can then set up the same graph for each API within the script or leave the output in a table.
The code for tiles in both variations can be found at link.
Trust, but check the results
Now we are not distracted by the measurement mechanics, and a visual report allows us to quickly analyze the results. This means that we can address the problem of code quality in load scripts.
Failure story
All I need to do is remember this epic failure, and you can imagine the situation.
Let's say there is a task – to test the response speed of an API for an authorization request in comparison between two releases.
Below is the code. We have only 1 request in the script, which is executed for various accounts:
import http from "k6/http";
import { scenario } from "k6/execution";
…
export default function (loginArray) {
const body = {
login: loginArray[scenario.iterationInTest % loginArray.length],
password: commonPassword,
};
http.post(`${host}/api/auth/login`, body);
}
We already had the result of the previous release, and measurements on the new version produce amazing results. 2x acceleration!
The team and I are celebrating the victory, and I start preparing a report on the work done… And then I notice from the logs that all 100% of requests returned an error.
Painful, offensive, educational.
NT theory does not recommend complicating load scripts with data generation and functional checks. But the minimum necessary checks of the result are needed!
Often, load generation services provide functions for checking data and results that do not eat up resources.
And then you can think about setting thresholds (criteria for success or failure).
For example, if a request takes longer than 200 ms, we will receive a color-coded notification that the set boundary has been crossed.
You can also configure the script to stop running if the indicators go beyond a specified threshold.
Agree, it is more convenient to receive a message from a bot warning about unexpected behavior of the service than a “red” report after 2 hours of waiting.
How else can checks be developed?
Combine checks and thresholds (with the ability to stop the script).
Set thresholds for each load stage.
Set different levels of response (acceptable / alarming / incorrect).
Take care of test data
Failure story
Let's imagine a situation where you, on a whim, decide to execute the same script several times in a row. (A true story that I remember with red cheeks again!)
You summarize the results in one table and get a rainbow, like in my picture:
The first thought my team and I had was: “There was a leak of resources!”
But, as it turned out, the problem lay elsewhere)
When starting regression NT, scripts were launched not only for reading data, but also for writing! Because of this, the collections in the database “swelled” with each iteration.
The NT theory insists on performing several similar measurements one after another.
What conclusions did we draw from this stupid mistake?
Before starting measurements, it is recommended:
check the availability, sufficiency and compliance with the expected volume of data
(including their absence);prepare the generation of similar data.
What else can be improved?
Automate checks for settings and other data configurations.
Automatically clean up individual collections or even trigger a database reset to its original state.
Assess the possibility of importing a database before taking measurements if you need a large slice of data.
Afterword
I will repeat my thought from first article: It's okay to make mistakes. Mistakes are experiences that teach us to expand our horizons and become better.
The path to NT is thorny, but terribly interesting! There are many more problems you may encounter than are mentioned in the article. But this is a reason to reconsider your experience and think about automation.
I also want to recommend Telegram channel “QA – Load & Performance” I met him when I was preparing to speak at the Heisenbug conference, and was pleasantly surprised to find a responsive community of load testing specialists there.