Hello, Habr! We constantly test various software solutions on our equipment, and sometimes a seemingly simple task unfolds for weeks. This is exactly the case today and will be discussed. The main character of our story is Pavel, a technical consultant at Atos in Russia.
Postgres Knights testing
So, on the Atos BullSequana S1600 server (16 Intel Platinum 8260 processors), logically divided into 2 halves of 8 sockets, 4 HBA Emulex LPe31002-M6 (2-port, 16 Gbps) are installed, 2 on each half. FC-links are connected through 2 MDS-switches manufactured by Cisco, and using multipath provide the system with one 6 TB disk. At the very beginning of the tests, each card was connected with just one link, but then, in the course of diagnostics, for greater reliability and generally increasing the slope, all ports were connected. In total, there were 4 FC links on each half of the server. During the tests, there was no work with the disk.
OS on both halves at the time of the start of our story is CentOS Linux release 7.7.1908 with kernel: 3.10.0-1062.12.1.el7
FW maps version – 18.104.22.168 (recommended by Atos, updated in the process).
The version of the lpfc driver – (apparently, native, out of the box OS) – 22.214.171.124.
The amount of available memory is only 4096 GB on each half, taking into account the reservation of a part of the memory for the needs of iron – 3968 GB remains for the OS.
It all started when the Postgres DBMS specialists decided to test the hardware using a stress-ng package, in an attempt to prove that our equipment could not withstand the load (they had incidents, during the investigation of which everything started happening).
The stress test parameters are taken “wonderful”, here is the run command –
stress-ng –vm-rw 1000 –vm-rw-bytes 2500G –verify –metrics-brief -t 60m
According to the documentation, such parameters mean that 1000 processes were started (start N workers that transfer memory to / from a parent / child), they gave 2500GB of RAM to each (mmap N bytes per vm-rw worker) and told to exchange using Linux functions – process_vm_writev and process_vm_readv, and the exchange result is checked for errors, and so on – an hour. There were no errors when transferring data, but there were problems with the OS and FC-links.
Later, I must say, they tested it with even more fun parameters – “stress-ng –vm-rw 2000 –vm-rw-bytes 3500G –verify –metrics-brief -t 10m”.
Linux is cool. Such parameters mean that Linux spent almost all the time on switching processes, accessing memory and transferring data between NUMA nodes, and by itself did not fall, but fiercely slowed down. After a little tuning, our equipment also began to cope with such a load without falling (Brakes? Who said brakes?), But the FC links really got bad – they fell off, one after another, and it was not possible to win with the settings.
From the side of the switches, it looked something like this:
2021 Feb 15 23:43:57 dn-MDS-C9148S-1% PORT-5-IF_DOWN_LINK_FAILURE:% $ VSAN 2221% $ Interface fc1 / 15 is down (Link failure loss of signal)
2021 Feb 15 23:45:24 dn-MDS-C9148S-1% PORT-5-IF_DOWN_LINK_FAILURE:% $ VSAN 2221% $ Interface fc1 / 27 is down (Link failure loss of signal)
2021 Feb 15 23:21:54 dn-MDS-C9148S-2% PORT-5-IF_DOWN_LINK_FAILURE:% $ VSAN 2222% $ Interface fc1 / 27 is down (Link failure loss of signal)
2021 Feb 16 00:00:02 dn-MDS-C9148S-2% PORT-5-IF_DOWN_LINK_FAILURE:% $ VSAN 2222% $ Interface fc1 / 15 is down (Link failure loss of signal)
Technical support is lying (well, or in good faith is mistaken).
Of course, the latter could not be tolerated in any way, and he set to work
a knowledgeable person is our military commissar technical support Emulex.
First, at Pasha’s request, the customer installed the Emulex OneCommand Manager Command Line Interface and tried some commands, for example, get a list of HBAs, check the status of ports, forcibly enable a port, reboot the HBA card.
None of this helped, but it became known that the exact status of the User Offline port, later, after analyzing the conclusions of the commands, the Emulex technical support gave the following answer regarding the status of the User Offline port:
“The Port state goes to User-offline, when the port bring-up fails even after reset. This is done by FC Driver. The reason for port bring-up failure could be due to various reasons (May be link issue (or) switch F-Port issue (or) HBA N-Port issue (or) authentication issue etc.). ”.
It was not possible to reboot the server yet (tests), the customer tried different solutions, but the disconnected ports did not rise, but the freshly connected ports worked. All checks showed that there is a problem with one of the ports of each HBA card, everything else (switches, SFPs, cables) is completely intact.
First of all, a hefty chunk of information was sent to tech support in the form of logs collected by a special OneCapture tool. Since the cards were more or less healthy (minus the ports), a set of logs was collected (although it impressed with the volumes – two packages of logs, 9 and 36 GB), and the smaller of them was sent to the valiant technical support specialists.
There weren’t enough logs.
Let us quote:
The issue here is that the link state went to LPFC_HBA_ERROR state because of which board_mode is showing port status output as error.
Driver will not be able to post mailbox if link state is in error state and it will start throwing errors.
To debug further, our Development team needs more driver logs with log-verbosity set to 0x1ffff on the errored port.
* Steps to follow to collect logs:
1.set the verbosity log level using HBACMD # hbacmd setDriverParam 10: 00: 00: 10: **: **: **: ** LP log-verbose 0x1ffff
2.Reset the port so that the port initialization events start # hbacmd reset 10: 00: 00: 10: **: **: **: ** (In case the boot mode is enabled, disable it using below command and then retry 2) (((#hbacmd EnableBootCode 10: 00: 00: 10: **: **: **: ** D)))
3.After few seconds if collect the onecapture again using below options to skip Linux crash dump collection. This will give compelete faster and less file size, as crash dump is skipped.
#. / OneCapture_Linux.sh –FullCapture –Adapters = all –NoCrashDump
4. After this Please collect HBA dump as well. Reason, onecapture failed to collect dump in previous attempt.
# hbacmd dump 10: 00: 00: 10: **: **: **: **
Then there was a reboot, and the links rose from the dead (and didn’t even smell). FW maps were updated to the version in the description, and Emulex technical support was delighted.
But then the testing continued and the links began to fall again. It became clear that the problem was not resolved, and the logs had to be collected again.
The customer, to whom the instructions were sent above, grinning and grinning (would you not be grinning?), Collected new logs … But there were some problems too – setting the flag of more detailed logs did not work, because:
By the way, we managed to defeat this – with the command echo 0x1ffff> / sys / class / scsi_host / host16 / lpfc_log_verbose.
“If you don’t want a pill – here’s a candle for you. The body is the same – the paths are different …”
The logs were collected and the Emulex tech support retired. I must say that they spent only a day analyzing the logs.
The answer was perfect:
Our Development team has analyzed the logs and gav below analysis:
Below sequence of events have forced the port to offline state:
1. IOCB time out error occurred and the IO abort was issued.
2. SCSI layer issued Device Reset of the LUN.
3. Bus reset performed by driver.
4. After the reset, driver failed to port sgl with -EIO error and brought the port offline.
There were also some kernel traces as well regarding tainted kernel (oracle linux)
wn2pynn00db0022 kernel: Tainted: G OE 4.14.35-1818.3.3.el7uek.x86_64 # 2
Our development team believes that, these logs indicate a possible scsi midlayer issue and not LPFC-driver or HBA-Firmware issue. Proper kernel upgrade may be required to resolve this issue.
Of course, there were attempts to understand what they were writing about, but this was not crowned with success. Yes, with these spells only wake Cthulhu …
Trying to find out – is there a workaround? – they said no.
There was nothing to do, and I had to go where I was sent – to the customer, to ask to update the kernel.
The customer, the core and already our technical support
The customer is responsive – the kernel was updated to version 3.10.0-1160.15.2.el7. And we ran the test. The links fell. The valiant knights of Postgres were happily rubbing their paws (this was evident from the letters, although it could be hallucinations from excessive communication with technical support of different levels).
So, links are still falling for some reason, there is no OS support (CentOS), figuring out the driver settings yourself is to waste a lot of time with no chance of success (have you seen that Talmud ?! – Here it is).
Our global support (yes, we also turned to our globals) also did not say anything new – the hardware is in order, the cards are in order, go to the OS tech support, but you can try to update everything to the latest versions. Or install a corporate OS with support, and then go to OS tech support …
What they forgot to touch
Pasha has been working in technical support for a long time, and what he learned is that often the source of a glitch is something that he forgot to touch.
For the entire time of this problem, we swayed and touched everything – the OS itself, FW, HW server firmware, HW server settings, GRUB parameters, factory settings, switches and links …
Everything except the lpfc driver.
There was nothing to lose, and in addition to the recommendation “switch to another OS” we also asked to update the driver, to the latest version on the Emulex website – to 12.8.340.9.
And it helped! After updating the driver, FC links no longer crashed. From an exhalation of relief, the monitor almost fell off, and Pasha himself (reactive effect, yeah) almost fell off his chair.
The customer confirmed that since then there have been no problems with link drops, even under stress.
Perhaps still lurking beneath the surface
monster some overlooked factor, but so far (more than 3 months already) everything is quiet.
We managed to overcome the problem of FC-links falling by updating FW, driver and kernel to the latest (or recommended) versions.
Technical support is lying (well, or is in good faith mistaken), so you have to diligently check everything yourself.
You need to touch and wobble when troubleshooting everything!