“Congratulations on the terabit.” The same article about DDoS-2023 – uncensored

Some time before

You most likely know about Timeweb. And most likely – as about hosting, which operates on a shared model, with sub-services such as domain registration, website builder, etc. There is also a cloud that grew out of hosting – this is what we will talk about.

In 2023, we began to slowly cut up the legacy hosting network and make it similar to the network of an IaaS provider. But in general, the architecture at the time of day X was quite typical for hosting: several routers, a stack of stretched VLANs, three transits and a couple of exchangers.

In the hosting business, the volume of channels is calculated from the volume of total traffic on the network, plus a reserve in case of accidents and attacks, usually not exceeding 10x the traffic in the CHN. From here, the protection of the infrastructure and clients is built: the maximum expected total attack power is assessed, a small margin is taken, and countermeasures are selected to protect an individual client without falling down.

It is important to understand that any typical hosting is under DDoS attack almost 24/7: at least one client is under attack at any given time. Therefore, DDoS for us is even somewhat banal. Over 17 years, we have seen a lot of things and, in principle, we know the key (and not only) patterns, where, what, where and how.

Good evening

On September 14, closer to 18:00, we were hit by another DDoS, which we know what to do with. They recaptured and went to rest. And before I had time to finish my tea, the attack was repeated, then again, and then again. Each time the interval decreased and the volume increased.

Screenshot from StormWall at that moment

Next, for those who like a brief summary, I will list the cause-and-effect relationship that has arisen regarding the infrastructure.

More is poured into the site than it can handle → routers are overloaded until signaling is lost → we block the attack via OOB via RTBH/FS or switch the network to a third-party traffic cleaning center → the target of the attack changes within five minutes.

Additional problems in St. Petersburg: uplinks are connected through switches with oversubscription, QOS either does not work, or there is not enough buffer → the signaling falls apart. Additionally, there are huge stretched VLANs, which means that an attack on one subnet will affect a huge number of clients. Monitoring and countermeasures are too slow.

Sometimes I had to prioritize

And in other locations: we don’t have our own network, and the data centers are blocking us, protecting their infrastructure. When the network is sent directly from the data center hardware, there is not only no way to block the attack, but even pattern identification is difficult: once – and the nodes fall off. Of the available information, only the trigger in Zabbix. The saddest moments were when we had several locations lying around completely and tightly for days. Even the uplinks of our providers in the data centers simply said that we are not ready to filter this, so we are turning you off. Once the attacks stop, we will connect back.

We're putting together a plan

First

: learn to block at least part of the attack at the level of provider routers. The goal is to reduce the impact on customers and protect the infrastructure.

Second

: teach our network to digest the entire uplink capacity without special effects, while simultaneously expanding it.

By hardware: disassemble a bunch of small routers and install a chassis, expand the channels or transplant them directly to the routers, or remove oversubscription. Parallel: improve DDoS protection to a state where we can block an attack faster than clients who are not receiving parasitic traffic will notice it.

And strategically: build your networks in all locations. And your own protection.

First of all, we abandon the existing DDoS suppression system, because it does more harm than good. Networkers begin to sleep one by one, we change the current flow monitoring to sampled Inline IPFIX with payload. This way, we don’t wait for the flow to gather and make decisions in seconds. This step helped reduce the average detection time of each attack: to understand that it had started and how to act, at first we needed a couple of minutes, a little later – 15 seconds, and now the automation reacts almost instantly.

Work environment

Initially, management was manual, but a little later decision-making became automated: monitoring learned to block DDoS immediately after detection. As a result, during the period from September 14 to 20, we blocked more than 20 thousand individual patterns.

At this time, through all channels – in the cart, in social networks, in tickets – customers were worried, cursing and asking questions. And I understand them perfectly. By the way, about beauty:

We are improving the protection: making it faster, more technologically advanced, so that it makes the most correct decisions. We are dismantling all the old architecture and redundant parts of the network – all under load and ongoing attacks, and so that the impact on clients is minimal.
Around this time, the attackers realize that we have done something, so they change patterns. We began to receive powerful short-term waves of traffic, which neither our program nor most of the protections offered on the market had time to respond to: the flood was so diverse and fast that our connections and some exchangers began to add prefix-limit sessions to us. During these periods the following happened:

Building our network

Let's start with Peter. The plan included upgrading the router with installing additional cards and connecting to various channels and traffic exchange points. The goal is to increase throughput: we needed to learn how to accept attack traffic and block more precisely, and not just throw blackholes and remove prefixes. In addition, it became clear that the volume of attacks could grow and we would need to learn how to expand capacity more quickly, without going through the entire cycle of “find hardware → find capacity → collect.”

Main router in St. Petersburg. The screenshot shows the MX480 consisting of 2xSCBE2, 2xRE-S-2X00x6, 4xMPC7E MRATE

Regular routers are not always effective for such activities: they have an overly complex and expensive traffic processing pipeline intended for other tasks. Based on this, we decided to act comprehensively: in addition to expanding channels and increasing the port capacity of service and edge routers, we began to implement packet platforms based on Juniper PTX. Although they are simpler, they have a lot of cheap 100G/400G ports, which is exactly what we need.

Thanks to good relationships with suppliers, we were able to quickly find network equipment: delivery took only a month and a half. For equipment of this class this is very fast.

As a result, in St. Petersburg we increased the capacity in the main directions to 500+ Gbit, and in the autonomous area we now have a total of about a terabit. Just two weeks after this, the situation in St. Petersburg stabilized: there was enough capacity, the filters were processed promptly. In other locations the network was rented: both in Kazakhstan and in Europe. For this reason, in parallel with leveling out the situation in St. Petersburg, we have a new priority task: to install our own routers in foreign locations and reach them from St. Petersburg – we decided to reach through M9.

Nine is still the largest peering point and the concentration of telecom infrastructure in the Russian Federation and the CIS. In addition to the main routes for the whole of Russia, channels from the CIS also enter there – often the only ones.

Trunk channels between sites provide several advantages:

  1. The ability to send and receive traffic through any other TimeWeb connections in all countries.
  2. The ability to provide clients with additional services in the form of communication channels.
  3. Our control will never fall apart, even if the external joints in the location are jammed under a shelf.

Actually, we start with Kazakhstan. We extended the channel to nine and sent traffic through our network.

MX204 on nine, collects highways and external links. We will soon replace it with the 960 and will fill it with hundreds

By the way, we were not lucky with delivery to Kazakhstan the first time. At the Kazakh customs, the staff changed – and everything became a dead weight. The situation was solved creatively: we sent one of our employees to carry 204. It’s funny that initially we were going to send MX104 – a rather old platform that had long been discontinued, which, however, was sufficient for the needs of this site.

MX104 from stock – a piece of telecommunications history

But because of its bulkiness, 204 were sent – and now in the Kazakhstan data center we have a platform that is enough for an entire cloud computer room, and not for our several racks. All that remains as a keepsake is a photo with a sticker from Ekb airport:

By December, we had reached Europe: now we have nodes in Frankfurt and Amsterdam with rented trunk capacity. There appeared access to the Internet – through Tier-1 operators, and to European exchangers.

The next logical step is to transfer the sites in Amsterdam and Poland to our network. Now no one will disconnect us in case of attacks, as a bonus – there is more Internet and dedicated channels for client dedicated servers have appeared, soon they will be throughout Cloud. As a result, you can not only order a server with 10G Internet, but also expand your local network to any of our points of presence – with guaranteed bandwidth and any settings convenient for you.

Since we’re going by location, I’ll add that this year we launched in Moscow, in IXcellerate. This is Tier-3 with a unique “cold wall” cooling system. I was there on an excursion – probably the most interesting thing I saw in Russia and the CIS, and I traveled a lot. Their beer was also tasty – that’s also a plus 🙂

Moscow, by the way, immediately launched with a normal architecture: wide links to the rack, 200G up to nine, a scalable aggregation layer. By default, we provide gigabit per second to all virtual servers instead of 200 megabits; 10G/40G is available on all desktops upon request. As a result, if customers need it, we can provide much more capacity than we could in St. Petersburg just six months ago.

2xQFX5120-32C in IXcellerate

Why didn't we hide behind contractors?

In fact, we approached several companies, but at that time it became clear that we would not be able to use them as a general means of protection for the entire network.

Solutions for comprehensive DDoS protection are essentially divided into two types: ready-made solutions that are installed directly on a company’s network, and third-party traffic cleaning centers through which this same traffic must be passed. At that time, we had our own cleaners and had experience in building similar solutions. After consulting with colleagues in the workshop, we decided to follow exactly this scenario: receive traffic → clean large flows at the network level, involving cleaning centers in individual cases → do fine cleaning using vendor solutions.

It is important to note that when using in-house DDoS protection solutions, you need a network that can process and filter traffic without affecting other clients or locations. What cannot be filtered should be directed to fine filtering systems with minimal latency and impact on pure traffic.

It was necessary to first improve the network to this state, and then implement off-the-shelf solutions. We transferred the attacked subnets to partner cleanup centers, although sometimes this had more of an impact on normal traffic than it helped.

This state of affairs was due to the nature of the traffic itself and the fact that the attacks were short and frequent: it is unrealistic to classify “clean” traffic in a subnet, the composition of servers of which changes from week to week, if not more often. And switching routing during the time between attacks is often not possible at all: targets change faster than BGP updates spread across the Internet.

What now

Similar attacks do occur, but in general we have learned to filter them: clean them at the network equipment level or deprioritize/block the client if the volume exceeds thresholds.

There is still a lot of work: this entire newly formed network needs to be backed up and expanded. We took a whole rack for the M9 and are going to install the MX960 chassis – with a large margin for the future. It will serve as a backbone interchange, accept external connections and act as the core of a network of data centers in Moscow; we have big plans there. Let’s not forget about the Northern capital: the PTX10003 platform will become the core of a new node on Kantemirovskaya (“Rainbow”), where it will bridge highways and junctions with external networks, acting as part of the traffic clearing infrastructure in St. Petersburg.

We are at the testing stage with Servicepipe: we will try a system for fine-grained traffic cleaning – if an attack on a client does not threaten the infrastructure, do not completely block everything, but take the attack traffic on ourselves and give it to the client, already cleared, up to L7.

There will be a lot of work on peering: we think that soon we will make direct connections to Google, AWS, Azure and other hyperscalers. We want to organize serious product advantages for our clients: if your product requires very good connectivity with the world’s clouds or you have a multicloud, we can provide both good connectivity over the Internet and leased lines, if necessary.

For dedicated servers on the network side, we have a very interesting offer. Upon request of speed, we provide up to 100-200 gigabits per server, which few people can do. In addition to simple Internet, there is a wide range of network solutions: here are more or less standard L2/L3 VPNs on MPLS, and any manipulations with external routing. For the owners of our AS, we will collect transit and attract any popular traffic exchange point. For those who want to become one, we will help you release an ASN, rent or buy networks, and announce them to the world.

Globally, we had competencies, resources and the opportunity to spend them on investments in the network; we had contacts and established relationships with a huge number of people. All this helped us a lot.

And finally – an awkward question from marketing

Why didn’t they start doing all this earlier, and the trigger was that we were under attack?

The expansion as a whole was planned; Timeweb always focused on network quality, but the process was gradual. Historically, we focused on our central hub in St. Petersburg, and construction in other locations was planned as they expanded.

Why did the attacks become a trigger? Everything is banal. Firstly, we realized that we began to grow much faster than we had planned. It became clear that we needed more capacity, more services—and not just some time ago, but now. Secondly, new threats have emerged that are not countered by standard means. At some point, we outgrew the “standard” approaches that a smaller player could use – it was necessary to cut it with some kind of custom or with the help of a third-party company, or on our own. We chose the second one.

Join our community on Telegram – here you can communicate with the community, ask questions to managers, CEOs and founders, and suggest ideas

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *