Brutal suicide of Veeam Backup & Replication or a little bit about finding and solving problems (troubleshooting)

Recently, our small firm was contacted by colleagues from an adjacent organization (in fact, this is a neighboring department) with a question: we have an underground knock. Please explain how it happens with Veeam. I had to help.

To understand what happens next, it’s worth refreshing (many will find out) what Veeam Backup & Replication (hereinafter referred to as Veeam) consists of. This is a backup system that works on the principle There can only be one – there is a management server, and it is one, on my own, my ownand applications to it:

(note for moderators: the text is of low technical quality, does not contain stories about health and Elbrus, DO NOT MOVE IT FROM THE STORAGE BOX)

Other servers can be connected to the management server – storage, hotadd, replication, tape library management, etc. It sounds loud, but in fact, an instance or a separate (physical or virtual) server is deployed, and 1-2-5 services are installed on it. In this case, I was not very interested in various details about the work of agents, the connection with the storage system and the portal.

Tape libraries are a niche product in terms of operation. Few people are ready to immediately pay 5+k$ for a drive plus 2-3k$ (and more) for a robot, and even cassettes (tapes) must be purchased separately. Operation is complicated by the fact that tape libraries and the modern 99.95% virtualized-dockerized environment are not quite compatible. In particular, the position of VMware – back in 2011 cat abandoned kittensfurther themselves as they want (Understanding support for tape devices with VMware ESXi and VMware ESX (2007904))

Documentation for KVM refers to 2016 year, documentation for Proxmox – to version 4, this is 2015

Microsoft also somehow does not approve (one, 2)

So the same picture above looks a little different –

Picture taken from Kateriot, here

If someone finds this text, then the problems when forwarding a tape drive (drive) into a virtual environment boil down to the fact that the device is composite, and it is necessary to ensure the availability of not only the drive, but also the robot for rearranging cartridges – it also looks like a SCSI device . There are options here, depending on what you have – SCSI / SAS or FC.

Problems were with a tape library. For my happiness, colleagues you can say somewhere in the garbage they found, washed, cleaned old, but still alive HPE Gen7, and connected to the library (to the drive, of course), and he us..

Required note. HPE Gen7 is no longer very usable either. It’s not that it can break tomorrow, and there are no spare parts – it can somehow be bypassed. The problem is that the latest FW ilo3 – 1.94 – came out two years ago, in 2020and you will get ERR_SSL_VERSION_OR_CIPHER_MISMATCH when you try to access. There are methods against it (one, 1.5, 2, 3, 9 ¾ )

Connecting a tape library.

A tape library can be connected to a server in different ways. There is a good old SAS (as BC023A ). There is FC. In the case of FC, there is an option to enable it through a factory (switch), or directly, point-to-point (direct connect).

But not everything is so simple, although they write on the forums For Direct attached FC is required to use Private loop, Supported only when running 8Gb or 4GB speed using a private loop topology. Direct Attach topology Not supported with LTO-8 FC drives. Direct attached Fiber Channel Point-to-Point is not supported with LTO-8.

By default, the mode is set to automatically detect, but for what many (including me) dearly love HPE – this is because the official documentation says one

When using an LTO-7, LTO-8, or LTO-9 FC drive with a 32Gb or 16Gb HBA in direct attach mode, Port Type is typically set to Fabric Mode. Early (Gen5) 16Gb and 8Gb/4Gb host adapters may require the topology to be set to Loop Mode. or THIS OTHER

Depending on the phase of the moon, software version and other factors, it may be that Full Auto does not work, so

“For LTO-7 and LTO-8 drives, Port Type should be set to Fabric Mode in most cases. Some older FC host adapters may require this set to Loop Mode.” (one)

Here we can only use the guide “application of topological closures”, if you, of course, find it. Author Dee, translated by Semin.

Problem.

The problem is simple – Veeam did not want to record more than 5-8 terabytes on cassettes. 2-3 terabytes recorded perfectly. Sometimes. In the event of a failure, the task ended with an unreadable error “Something went wrong”, then after an hour, then after three, then after five. No other factors were immediately identified. Topology: Veeam Virtual Server, HPE Gen 7 Physical Server, IBM Tape Library with Direct connect .

Investigation.

Veeam error: Failed to continue tape backup session on the next tape medium Tape fatal error. The I/O bus was reset. Cannot append file block to the end of file. Unable to asynchronously write data block.

First of all, it was decided that the problem might be somewhere in physics. However, the tape library did not report anything like this, even when the log level was turned on to the maximum.

HPE has HPE Library and Tape Toolsit allows you to perform a number of operations to test the tape library, just remember to stop all services that work with the tape.

The optical patch cord was changed just in case. It didn’t get any better. The problem here is that in FC, to view the signal level on the host and errors at the FC level, you need to go somewhere in OneCommand Manager, aka Emulex HBA Manage (one), or in Qlogic SANSurfer / QConvergeConsole (Marvell QLogic QConverge Console ) (one, 2), or Brocade Adapter Software (one).

The next step was to collect statistics. This step was out of stupidity, because it was not clear whether it was a one-time failure or not, on what volumes or types of tasks it occurs.

It would be better, of course, to take the article How to Collect Logs for Veeam Backup & Replication, but without vendor support, it is interesting to look at uploading all the logs. But only.

The collection of task statistics, however, gave a result, albeit a strange one – the failure occurred at the same time (several times a day, but at the same time on different days), and then it became interesting.

I (finally) opened windows logs and saw the message – Base port lost fabric connectivity. Like here – Veeam Tape backup error – Qlogic FC adapter: Base port (WWN = 10:00:8c:7c:ff:65:f7:38) lost fabric connectivity.

We need to go deeper.

When we need to go deeper, first of all we need to remember strace, secondly Process Monitor.

In my case, it was Windows, so they took Process Monitor and went to watch, record and filter.

It turned out that at the time when the problem starts, the VeeamDeploymentSvc application calls wmiprvse (Windows Management Instrumentation (WMI) Provider Service) to poll devices, after which bfadi (bfadi.sys) becomes bad – and this is FC / FCoE HBA Stor Miniport from Brocade Communications Systems

At this point, I was a little surprised, because from my point of view, no tasks were launched, and all auto polls went once a day.

What a brutal suicide.

After a smoke break, coffee and the Internet, it turned out that no, there is a process – Veeam Host discovery and it is even visible in the Veeam GUI and tasks. There is a process, but there are no settings for it in the GUI. And there is no open documentation – that’s the process, and that’s it. I had to search again. For normal people, the settings are somewhere in the settings file. For Veritas Backup Exec / NetBackup, the settings are either a number in the file or even the presence of a file somewhere in NetBackup\db\config\ or /usr/openv/netbackup/ . Commvault combines best practices – and registry and not only. Especially considering that as a result of development in Commvault there are already two or three configuration storage bases only. Veeam follows best practices by storing settings either in HKLM\SOFTWARE\Veeam\ (which even has documentation ) or in /etc/VeeamAgentConfig . And saving or trying to read the settings from the registry, which is sometimes written on the forums (one, 2, 3)

They also sometimes write about different keys – such as FilesNotToSnapshotand sometimes they don’t write

This plug-in version does not support Freeze-only snapshots. If you want to enable VM processing using VMware snapshots, contact Veeam support to change the value of a dedicated registry key.(one)

Solution.

The solution is obvious – Hanc marginis exiguitas non caperet. There could still be a link to my blog or telegram channel, but this is redundant for a closet.