By some estimates, Twitter has lost approximately 80% of its employees. Whatever the real number, the company has teams in which the developers have completely disappeared. However, the website continues to operate and tweets continue to be posted. Because of this, many are wondering what is happening with all these developers, and whether the company’s staff was simply bloated. I would like to talk about my own little corner on Twitter (however, it was
and small), as well as the work that was done to keep the system running.
Brief information and history
I worked as a Site Reliability Engineer (SRE) at Twitter for five years. Of those four years, I was the only SRE on the cache team. There were a few people before me, and I also worked with a whole team that many people came and left. But for four years, I was the only one in the team responsible for automation, reliability and operation. I designed and implemented most of the tools that kept the job going, so I feel like I’m qualified to talk about it. (Probably one or two people have similar experience)
The cache can be used to speed things up, or to reduce the volume of requests to a subsystem that is more expensive to run. If you have a server that takes 1 second to respond, but it’s the same response each time, then you can save that response to a cache server, where the response can be sent in milliseconds. Or if you have a server cluster that can cost $1000 to process 1000 requests per second, you can use a cache instead to store requests and return them from that cache server. Then you get a smaller cluster for $100 and a large and cheap cache server cluster for about $100 more. (The numbers given are just examples to demonstrate the principle.)
Caches take up most of the traffic a server deals with. Tweets, all user feeds, private messages, ads, authentication – all this is issued from the servers of the cache team. If there are problems with the cache, then you, as a user, will immediately notice it.
When I joined the team, my first project was to replace old cars with new ones. There was no tool or automation to do this, I was just given a spreadsheet with server names. I can say with pleasure that now the work in this team is completely different!
How the cache works
The first important point in keeping caches running is that they run like Aurora tasks in Mesos. Aurora finds servers to run applications on, while Mesos aggregates all servers so that Aurora knows about them. Aurora also keeps applications running after they are launched. Let’s say if a cache cluster needs 100 servers, then it will do its best to keep all 100 working. If a server completely breaks for some reason, Mesos detects this and removes the server from the aggregated pool; this way Aurora will know that only 99 caches are working and that it needs to find a new server from Aurora to start up. It automatically finds the server and the total returns to 100. No human intervention is required.
In a data center, servers are placed in so-called racks. Servers in racks are connected to other servers in racks using a server switch. Next comes a whole complex network of connections between switches and other switches and routers, eventually connected to the Internet. A rack can contain approximately 20-30 servers. The rack could fail, the switch could fail, or its power supply could burn out, causing all 20 servers to shut down. Another convenience of Aurora and Mesos is that they make it so that too many applications do not fit in one rack. Because of this, if an entire rack suddenly goes down, Aurora and Mesos will find new servers that will become a new home for the applications running in the rack.
The above spreadsheet also tracked the number of servers in the racks; the compiler of the table sought to ensure that there were not too many of them. Thanks to new tools, when new servers are put into operation, we can track all this automatically. These tools ensure that the team does not have too many physical servers in one rack and that everything is distributed so that random failures do not cause problems.
Unfortunately, Mesos does not detect failures of each of the servers, so we need additional monitoring of hardware problems. We monitor aspects such as defective disks and failed memory. Some of these problems don’t crash the entire server, but they can slow it down. We have an alert dashboard that is scanned for broken servers. If one server is found to be broken, we automatically create a repair ticket for the data center employee to fix the problem.
Another important software that the team has is a service that monitors the uptime (uptime) of cache clusters. If too many servers are down in a short period of time, then new tasks requiring cache removal will be rejected until the situation is safe. This is how we protect ourselves from the sudden shutdown of entire cache clusters and from excessive load on the services with which we protect them. We have limits set up for cases where too many servers go down quickly, too many are under repair at the same time, or when Aurora cannot find new servers to host old tasks. To create a repair ticket for a server that is broken, we first check if it is safe to remove tasks from it by checking this service; if it is empty, then a mark is created that the data center technician can safely work with it. After the data center technician marks the server as repaired, our dedicated tracking tools will automatically wake up the server so it can perform tasks. The only person that is needed is the one who repairs the server in the data center. (I wonder if these are still people?)
We’ve also fixed recurring app issues. We’ve had bugs where new cache servers weren’t added back to the pool (startup race condition) or when it took up to 10 minutes to add a server back (logic O(n^n)). Because all this automation work kept us from drowning in manual work, we were able to create a culture within the team to fix these issues without hindering projects. We had other automatic solutions to problems, for example, if some application metrics were thrown out (say, delays), we automatically restarted the task so that the request for troubleshooting did not go to the engineer. The team received about one request per week, and it was almost never mission critical. During the shift of duty, it often happened that there was not a single application.
Also, one of the most important reasons that the site did not “lay down” was capacity planning. Twitter had two data centers that could withstand the load of a complete transfer of the work of the entire site to them. Every important running service could run from a single data center. The total available capacities at any given time were 200%. This was only necessary for catastrophic scenarios, and most of the time both data centers served traffic. Data centers are occupied by a maximum of 50%. In fact, in practice, this can even be considered a high load. When calculating capacity requirements, it is determined what is needed for one data server to handle all the traffic, and then a reserve is usually added on top of that! If the load does not need to be transferred in an emergency, then there is a large reserve of servers for additional traffic. The failure of an entire data center is a rather rare event that has happened only once in my five years of work.
In addition, we ensured the separation of cache clusters. We did not have multi-tenant clusters that handled everything and had isolation at the application level. Due to this, if problems occurred in one cluster, the “radius of destruction” covered only this cluster and, perhaps, several neighboring servers on the same machine. Aurora contributes to this by providing distribution of caches: the problem affects a small part of the devices, and monitoring over time allows you to fix the problem.
So what did I do there?
All of the above! I also spoke with clients (teams that used the cache as a service). After I completed the automation, I automated further. I also worked on interesting performance issues, experimented with technologies that could improve the system, and led several large cost reduction projects. I did capacity planning and determined how many servers to order. I didn’t just get paid to play video games and drink coffee all day, as it might seem.
This is how we kept the caches serving requests to Twitter alive. This is only part of what the daily work consisted of. It took a lot of work over many years to reach this position. And now we can pay tribute to the fact that the whole system continues to work!
At least for now – I’m sure there are bugs somewhere inside…