Automate! You can't do everything manually

And in order to share experiences and learn new things, my team and I decided to participate in the Heisenbug conference for testers this spring with a stand and a report on automated testing of integrated software.

To do this, we first had to prepare the stand for testing – under the cut I will tell you how and what my team and I prepared. The article will be interesting to everyone who works with hardware and is engaged in automated testing.

Test object

Any equipment is primarily hardware, albeit high-tech. However, for full-fledged operation, it needs its own software or integrated software. The simplest example of such software is BIOS, but there are others.

The main character of my report was an integrated software called BMC – Baseboard Management Controller.

In general, BMC is the controller itself, but in everyday life, when we say BMC, we mean a bundle of the controller and its firmware. The controller is located on the motherboard and, thanks to the firmware, can communicate with all the peripherals on this board using special protocols, including polling it and sending control signals. It can communicate with an external device, which allows you to control the host remotely.

What is this all for

You probably use GitLab or something similar to manage your code, some artifact storage to store the results of your work, you probably use something to deploy your applications, you use banking services or data processing services.

All this works on some servers. And to maintain smooth operation and monitoring of these servers, BMC is used.

How to use

BMC has a simple WebUI, the web page displays all the information about the server status, firmware version, you can view the inventory, change parameters, and so on.

But when you have a lot of servers, WebUI is not a very convenient option if you want to automate work with servers. For such tasks, there are two command line interfaces – IPMI, created by Intel, old but still in use, and the newer Redfish, based on REST requests.

Here is an example of using Redfish via curl:

curl https://my-bmc-01/redfish/v1/Chassis/Self/Sensors
curl https://my-bmc-01/redfish/v1/Managers/Self/NetworkProtocol

It is clear that it is very easy to automate work with the server through such requests.

How to test BMC

There are different ways to do this, but we made a key decision to automate everything. Because this will allow us to quickly and effectively scale the process in the future.

Our team had extensive experience working in large IT companies such as Borland, Oracle, Huawei, Deutsche Bank and more than 20 years of experience in IT for the team leaders. However, we automated testing only for application software, where, at first glance, everything seems completely different. Therefore, we had to try a lot for the first time and come up with almost from scratch.

And if you need to try something for the first time, it is easiest to do it from the user's side, so it is logical to start testing from this side, that is, we will start with the black box method.

From this point of view, we have a system with three protocols, and for us they now look like regular software that needs to be tested. Accordingly, we need to choose a programming language. The first thought is to use Native Language, but why, if you are testing from the user side. In addition, it so happened that we already had developments on the Python + Robot Framework bundle. Python is a popular and in-demand programming language, Robot Framework is an easy-to-write and read framework that allows you to generate simple, understandable reports.

Since the test objects are physical servers and not the software itself, we expected the following issues to arise:

  • How to scale? After all, you can't take 1000 servers;

  • You can't just kill a frozen host and bring it back up again;

  • How to organize coverage of different configurations if you need to physically change the server;

  • Physical access is required if something is wrong.

Now we have reached the most interesting part – preparing the stand for testing.

Preparing for tests

First, you need to learn how to flash the BMC. This can be done via WebUI, but if we want to automate it, it is better, of course, to do it as a script via Redfish. We prepare a script and start trying. The first launch – everything is fine, the second – a surprise, the test object turned into a brick, i.e. the BMC simply does not respond.

It is not clear what happened and what to do about it, but there is a solution. There are special pins on the board for BMC debugging. Take the Raspberry Pi and connect its USB to the connector using adapters. Then connect to the RPi via ssh, and from there go to the BMC console. This way you can view the logs and try to return the stand to its normal state. For example, using a simple shutdown -r now.

We have the first architectural touches:

We overcame the first problem and started running tests. And again, after several launches, the stand turned into a brick, and the tests hang. We connect to the BMC via RPi – zeros pour into the console, we can’t enter a command…
Let's start to understand in order:

1. Hanging tests.
Half of the tests ran, and half hung. We would like to get the results of those tests that did run. To do this, we divide the tests into “polite” ones and those that can make the stand unavailable. We let the “polite” ones run first, and the rest – second.
2. Brick instead of test object.
There is a programmer, we take it in our hands, take the server out of the rack, open it, connect to the connectors, reflash the BMC and the stand is alive again.
3. Critical bug.
We wrote tests that emulated user actions and brought the stand to a brick state. That is, we have a critical bug, but how to investigate it? It is still unclear, and it turns out that it is difficult to reproduce. The solution is this: for future research, we save logs from the BMC console and send them to the Elastic database.
This gives us the ability to track all logs and analyze BMC behavior.

Let's continue our adventures

We launch more and more test runs and it turns out that the connection between the raspberry pi and the BMC is unstable. For example, the connection is broken if someone logs in on the same channel to the BMC. You can write a script that restores the connection, or prohibit logging in manually, but this is no longer reliable. In addition, there are four ports on the raspberry pi, and there will be five servers at some point, and then more, because we want to increase the number of test stands. The difficulty of scaling is already visible. And again, there is a solution – a port converter with twenty-four ports that turns them into telnet. That is, you can scale the infrastructure and now the connection will not be broken when connecting to the BMC console.
Well, and in parallel we are solving the second problem – to monitor the stands, we wrote Test Object Monitor. Its main task is to report the unreadiness of the stand before the tests go there and save us time and nerves. In addition, it can show various technical information.

And our structure is growing and now schematically looks like this:

We start the tests again, select test groups — and it turns out that we have run out of stands. Finding a new server is not so easy — you can’t get it out of your pocket. We go to colleagues for help and it turns out that there is BMC virtualization on QEMU. The solution is not complete, but it exists, it can be automated, and the tests can be divided into those that can be run on a virtual machine and those that cannot, and configure launches on QEMU. As a result, we get a real CI, and Test Object Monitor turns into Test Object Manager (TOM), since it not only monitors ready stands, but also services the test launch queue and can raise QEMU for them if a virtual machine suits them, or it can give a physical stand. In addition, now you can solve the problem of different environments and TOM will automatically launch tests on stands with the equipment requested by the test plan creator. And to analyze the results, we use the Report portal (https://reportportal.io/), which allows us to compare different runs and automatically separate new test failures from already known ones. A good tool. Not the only one on the market, and we still have to choose the best one, but this is ahead, which I will definitely tell you about later.

What about testing diversity?

So far we have only system testing, but there are different approaches. Therefore, our plans include increasing the diversity of testing in the future.
And here's what we'll definitely try:

  • Automation of environments:

    • Anything that doesn't require physical access can, in theory, be automated;

    • What requires, we must try to bring into such a state that it does not require.

  • Emulation of everything possible:

    • Backend for WebUI – definitely possible;

    • BIOS? We don't know yet, but it seems that it is partially possible;

    • Hardware? After all, hardware communicates with each other and with BMC using some protocols;

    • Erroneous states? We know for sure that it is possible to emulate a signal about an erroneous state of hardware, and therefore it should be done.

What can be said in conclusion?

Analogies are a powerful tool. It turned out that many approaches from the world of regular software, with some adaptation, work great in the world of Hardware. Actually, we knew this before, and now we are convinced once again. So we will continue to search…
Let's continue to be creative. Firstly, it brings pleasure, secondly, it brings results.

Share your thoughts on this story and ideas for testing automation in the comments!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *