Stumbling Blocks and Other Stories about the Everyday Life of Service Support in the New Reality

The market for spare parts and IT equipment is in a turbulent state. Customers, services, integrators – everyone is looking for non-standard solutions to maintain the operability of systems. The current situation periodically gives rise to, let's say, non-trivial problems, the solutions to which have to be found on the go.

We've put together a selection of stories about the challenges our engineers have faced over the past two years. Just don't expect heroic tales – this is about the hard, everyday work, without embellishment or cuts.

Market issues: what is happening with service in Russia now

One of the main problems of the industry today is the lack of spare parts and the failure of outdated equipment. Customers still need service, but many vendors have left, and it is not always possible to service the equipment on your own during the migration to new solutions. As a result, companies come to us for service – no one else will help.

In 2022, the situation was relatively stable, but since 2023, the number of applications has increased sharply. Until 2022, we serviced some of the customers' equipment ourselves, mainly less complex devices, while the rest was handled by vendors. After their departure, we continued to fulfill our obligations ourselves. There were few new contracts during this period, as customers adopted a wait-and-see attitude, hoping for a return to the previous conditions.

By autumn, when it became clear that vendors would not return, customers began actively looking for new partners for service maintenance. Customers would naturally appear on the doorstep with lists of equipment for over 500 items and ask to either replace it by tomorrow or provide a commercial offer for a thousand spare parts. By this time, we had already accumulated significant experience of independent work and were able to offer them assistance, although we had to assign separate specialists to work with such urgent requests.

A separate problem is unscrupulous suppliers. Sometimes they replace a part with a similar one without any notice. Sometimes they replace labels in the hope that everything will work anyway. But in addition to the hardware itself, you also need up-to-date documentation, which you have to look for from customers or partners. And even this does not insure against problems. Incidents with undocumented features of spare parts are resolved only thanks to accumulated experience.

Case one: two of a kind, identical in appearance – Hitachi controllers

We were among the first in Russia to encounter this problem. Perhaps no one else knew about it at that time. We delivered a Hitachi E790 storage system and had to provide a warranty with a strict SLA. We studied the available documentation, purchased the necessary spare parts and worked for some time in full confidence that everything would be fine.

However, one evening (on a weekend, of course!) the engineer on duty suddenly wrote asking for advice on which batch number of the Hitachi E790 controller should be requested from the warehouse.

The situation is strange, since the documentation we used to prepare for it only has one controller option (CTLMN type). In response to our bewilderment, the engineer sent a screenshot from the documentation, which already has two options.

It turned out that only half of the delivered storage systems had CTLMN controllerand in the remaining ones a new one CTLMNA controller. The difference between these parts remains a mystery, but experience has shown that they are not interchangeable. What a surprise…

Apparently, the vendor has released the second revision, but we do not have direct access to Hitachi service documentation. It comes into our hands with a delay. So we could not prepare in advance. In addition, in this model of storage system, you cannot see the PN controller from the outside: the sticker with the number is only on the inside, and this information is not available through the storage system management interface.

As a result, the engineer on site had to start the old controller with a tambourine. Since the customer had provided for redundancy in their infrastructure, this did not create any serious risks for the business, and the controller worked for some time while we waited for delivery from the Chinese factory.

Considering the risks of logistics and the impossibility of a repeat error, we ordered the controller via express air delivery. Moreover, the battery was removed from the controller in advance, since they are prohibited for transportation. Additionally, we ordered another whole storage system immediately assembled – on an alternative flight. This is to be sure.

After this incident, we introduced a rule: engineers must check the relevance and use the latest documentation. We learned to remotely determine the controller type by indirect signs, using a database of web interface screenshots. And, of course, we created a stock of both controller options in the warehouse. Of course, it is better to learn from other people's mistakes, but for now we have to learn from our own.

Story two: how we adjusted tape drives

Large companies use and will continue to use tape storage for backup for a long time, but in 2022, difficulties arose with the purchase of many library models.

The situation with tape LTO drives is very specific, there are not many of them in the world. Previously, they were produced by only two companies: IBM and HPE. With the advent of the seventh generation (LTO7), there is only one manufacturer left – IBM. The company produces drives for different vendors with different firmware: for Quantum, for HPE, and so on.

Previously, we, like other vendors, performed block repairs. We simply replaced one part with another with the same part number. Specific part numbers were defined for each model. We bought them or received them from the vendor, and that was the end of the service.

In 2022, it turned out that for some libraries, equipment can be purchased relatively easily, while for others it is extremely difficult, and no one has canceled the contracts. That's when we remembered that all these drives are de facto from the same conveyor.

We found an employee of one of the vendors who agreed to help us — to find a way to reflash common drives for use in such “rare” libraries. For example, from IBM TS3310 to Quantum i500.

The task turned out to be difficult and we were unable to solve it, even with outside help. However, in the process another idea came to mind…

The thing is that the drive basically consists of a mechanical and electronic part. On the one hand, it is a set of gears, heads, contacts, cables, and on the other hand, a control board. Initially, it seemed that it was inseparable from the drive, but it is where the main differences lie. The mechanics are approximately the same everywhere.

We bought the necessary tools and started experimenting. As a result, we learned to swap control boards from one drive to another in about 30 minutes. Now we can quickly respond to customer requests without waiting a month for the necessary equipment to be delivered. We now use this method when, for example, there is an excess of spare parts from IBM TS4300 in the warehouse, but drives are needed for the HPE MSL3040 twin library, which is in short supply at the moment.

Story Three: The Tape Library Nightmare

There were two libraries being delivered to a customer and we were required to provide a three-year warranty with a strict SLA.

When working, we always take into account the probability of equipment failure. Knowing the number of articles the customer has, we form an operational stock in the warehouse for at least six months to a year in advance. Then we gradually buy more to replenish it.

This approach applies to all equipment, but tape drives are a special case. They are mechanics exposed to dust, frequent cassette changes, and head friction on the tape. Therefore, the probability of their failure is quite high: from 10 to 20% for older models that have been in operation for 3-5 years. For comparison, for other devices this figure is 1-2%.

We delivered 36 drives and 2 libraries to the customer. We additionally purchased 10 drives so that there would always be spare parts in the warehouse. During the first testing, the engineers found several faulty drives. They assumed that this could be due to long transportation on relays, or perhaps a banal defect was to blame. But the fact remains: every few weeks, another drive would consistently fail – either the power supply or other components.

We knew the statistics and saw that for this particular delivery the number of breakdowns was several times higher than the standard. In six months we used up all 10 spare drives, although we planned to use them for three years or even longer.

The tendency for breakdowns was a serious concern. We immediately paid attention to this, but new deliveries take two to three months. We had to urgently look for replacements for problematic drives on the market, change boards and develop alternative solutions. Now 10 months have passed since the delivery, 17 drives have already broken. It looks like by the end of the three-year contract they will all have to be replaced, possibly several times.

This experience has taught us to be even more attentive when choosing suppliers. Now we carefully check each partner, following the principle “measure seven times, cut once.” The main mantra of the IT procurement market in Russia!

Story 4: IBM's Storage System Has a Heart Attack

A couple of years ago, we took on support for an IBM FlashSystem 5200 with nine FCM disks of 19 TB each. These are SSD NVMe disks manufactured by IBM. To be on the safe side, we purchased two spare disks. One day, a request was received to replace a faulty disk. We issued a spare disk and immediately ordered a new one. The next day, another request came in. We issued a second spare disk, leaving us without a reserve. And then requests came in for two more disks…

It turned out that certain firmware versions of such disks at IBM were subject to the need for urgent updating. At the time when FlashSystem was supported by the vendor, this problem had not yet been revealed, and later the vendor did not inform Russian clients about it. However, at some point the disks simply started to turn off one after another.

Frankly, we did not expect such a catch and at first decided to simply replace all the disks in the storage system with the next generation as part of the service. Only later did we figure out that they could be updated and restored. Now we use old versions of disks for our own non-critical tasks and visit all customers who fall into the risk zone with software version checks. We strongly recommend updating.

Story Five: Suspicious Transceivers

Another company recently needed to replace about 350 32GB FC SFP optical transceivers on several Brocade X6-8 SAN switches. This is well above the expected failure rate, which is typically around 5%.

However, while testing the equipment, the customer's engineers discovered that the signal in the transceivers was slightly weaker than the established standard (maybe due to wear and tear, maybe for some other reason). Despite the fact that the transceivers were in good working order, their management decided to play it safe and replace several hundred devices at once.

When contacting distributors for replacement transceivers, we received an interesting offer. They reported the possibility of providing alternative devices at half the usual price. These transceivers, according to them, were reflashed and differed in the absence of stickers and a slightly different display in the system.

“A great offer!” we would have said, if we hadn't learned from the tape library incident. We had to politely decline, scour the market, overpay, but get guaranteed new and original equipment.

And numerous “set-ups” from suppliers

There are many similar situations here, but overall the problem is the same: there are a lot of counterfeit products on the Russian market now. Re-gluing, re-marking and all sorts of fraud are common.

We had a situation when our employee was forming a specification for a purchase, but made a mistake in one article, indicating the wrong number. When the employee discovered his mistake, it turned out that the supplier, without batting an eye, confirmed the delivery under a non-existent article and even provided photos of the product. Soon, 10 “brand new” SSDs arrived to us with non-existent article numbers on the labels. And this is not a batch of hundreds of units, on the substitution of which you can make a lot of money, but a small, pinpoint forgery.

What haven't we seen in two years:

  • disks with 6G SAS interface instead of 12G (although the stickers are correct);

  • storage disks with incorrect sector size/firmware/identifier;

  • regular OEM RAM instead of specialized HPE Smart Memory/Lenovo TruDDR4 and memory for storage controllers;

  • Lezo servers (Lenovo SR650 V2), which had firmware not from the main branch of vendor updates, but very specific (possibly for the Chinese market or homemade), which made it impossible to switch from one branch to another. We had to change the entire board to a normal worldwide one.

Over time, the number of problematic situations increased, and they affected even those suppliers that were previously considered reliable. We realized that the previous approach to accepting spare parts no longer worked. It became obvious that we could not continue to accept components based only on number compliance.

Changing suppliers did not solve the problem: sooner or later, everyone started making similar errors. As a result, we came to the need to create a separate group of engineers specializing in checking all deliveries. Now, all potentially risky categories of goods do not immediately arrive at the warehouse, but are sent to these specialists for a comprehensive study. As a result of this approach, during the first weeks after receipt, some of the spare parts are returned to suppliers even before the expiration of the warranty period for delivery.

Experience, the son of difficult mistakes…

Work has become more difficult, work has become more fun. The experience of recent years can be most easily summarized in a set of rules.

  • You have to be flexible. The market is changing quickly, and solutions that seemed ideal two or three years ago no longer work at all.

Build relationships with customers: don’t limit yourself to dry contracts, but share expertise and build trust.

  • Transparent and regular communication ensures that both parties understand each other even in difficult situations. This trust helps to cope with any situation.

It is important to support the customer, even if it is temporarily unprofitable. For example, free replacement of spare parts, even at a loss to the company, can help in a critical situation and thereby strengthen partnerships.

Customers share information. No matter how much companies compete within their industry, their employees communicate with each other. The market is relatively small, and negative reviews spread quickly. No company would risk entrusting its critical infrastructure to a vendor with a dubious reputation.

  • Having a warehouse with rare spare parts is a big plus, or rather, a necessity.

Recently, when holding tenders, customers have increasingly expressed a desire to personally visit the supplier's warehouse. Everyone wants to make sure that there is a sufficient stock of components in case of unforeseen situations. A warehouse is not a museum, and we have not previously practiced demonstrating it to customers. We had to adapt.

The acceptance process now includes a thorough inspection and rejection of non-conforming parts. In response to this growing problem, in May 2024 we added two new positions – dedicated parts inspection engineers. They are focused solely on quality control of incoming components.

The optimal approach is personal control at the production site. If this is not possible, it is necessary to carefully check the supplied products remotely.

Investments in quality control pay off. The cost of sending a specialist is significantly lower than the cost of regularly replacing faulty drives. Moreover, having dedicated engineers to check deliveries is more cost-effective than risking losing customers due to poor quality.

Everyone in the industry is currently experiencing unusual challenges, and it is becoming normal to share solutions. Despite the competition, there is an understanding that in order to overcome common problems, it is necessary to join forces. This is, by the way, the idea of ​​this article. Join the discussion in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *