Local TCP Anycast is really hard

Pete lumbis and Network ninja mentioned an interesting use case in their comments on my UCMP posts UCMP in the data center: namely anycast servers.

Here’s a typical scenario they mentioned: a group of servers randomly connected to multiple leaf switches provide service on the same IP address (hence the anycast).

Before getting into details, let’s ask ourselves a simple question: Does this work outside of PowerPoint? Certainly. This is the ideal design for a scalable UDP service such as DNS, and large DNS server farms are typically built this way.

The complexities of TCP Anycast

A really interesting question: Does this work for TCP services? Now we get to the really tricky part – since spine and leaf switches do ECMP or UCMP on anycast IP address. Someone has to keep track of the session assignments to the servers, otherwise all the chaos will spiral out of control.

Please note that what we are discussing here is completely different from anycast WAN which works very well and is widely used. It is almost impossible to find yourself in a situation where you have equal cost routes to two different sites on the Internet.

It is easy to understand that this design works in a steady state. Switches in data centers perform load balancing of 5 tuples; thus, each session is sequentially redirected to one of the servers. Problem resolved … until connection or node fails.

Loss of connection and node

Most serial ECMP designs use hash buckets (more details), and if the number of next-hops (neighboring routers) changes due to a change in topology, hash buckets are reassigned, sending most of the traffic to the server, which has no idea what to do with it. Modern ECMP implementations avoid this by using consistent hashing. Consistent hashing avoids re-computation of hash buckets after a topology change:

Hash buckets for valid next-hops are not touched.

Invalid hash buckets (due to invalid next-hop) are reassigned to valid next-hops.

Obviously we will get some undirected traffic, but these sessions are hopelessly lost anyway – they were connected to a server that is no longer available.

Adding new servers

The fun begins when you try to add a server. To do this, the last hop switch must take several buckets from each valid next-hop and assign them to the new server. It is very difficult to do this without disrupting the server2. Even waiting until the bucket becomes inoperative (load balancing approach flolets), Does not help. A non-working bucket does not mean that there is no active TCP session using it.

And finally ICMP: ICMP responses include the original TCP / UDP port numbers, but no hardware switch is able to dig deep into the packet, so the ICMP response is usually sent to some random server that has no idea what to do with it. Welcome to chaos PMTUD

Make Local TCP Anycast work

Does this mean it is not possible to do local TCP anycast load balancing? Of course not – every hyperscale datacenter uses this trick to implement scalable network load balancing. Microsoft engineers wrote about their decision in 2013, Fastly documented my decision in 2016 year, Google has Maglev, Facebook opened Katran, we know AWS has a Hyperplane, but all we got from re: Invent videos, is amazing magic. At the Networking @Scale 2018 conference, they told more a few detailsbut it was still at Karman’s level.

You can do something like this on a much smaller scale with a cluster of firewalls or load balancing (assuming your vendor can count more than two active nodes), but the performance of network service clusters is usually far from linear – the more blocks you add to the cluster, the fewer performance you get with each additional block – due to maintenance of the state of the cluster as a whole.

There are at least a few open source software solutions that can be used to create large-scale anycast TCP services. If you don’t feel comfortable using hot new products like the XDP, there is BalanceD from Demonwareusing Linux IPVS.

On the more academic side, there is Cheetah… and in a bright future, we might get pretty optimal solutionresembling a session level with Multipath TCP v1.

For studying


Right now at OTUS we are launching Specialization Network Engineer… In this regard, we invite everyone to a demo lesson on the topic: “Technologies of the past. RIP”. As part of the lesson, we will consider the dynamic routing protocol RIP. Pros and cons of technology. We will analyze why it is not used in production, but where else it is needed, as well as what protocols came to replace it. RIP, in its simplicity of configuration and operation, will clearly demonstrate the logic of the dynamic routing protocols. Will give an understanding of the possibilities and the need to use such routing.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *