The Netflix Cloud Data Engineering team is happy to open source s3-flash-bootloader, our tool to perform an in-place OS image upgrade on government cloud instances by substituting the new AMI for the old one. In this post, I will talk about some of the things that inspired us to develop this tool and discuss how it has made updating Cassandra and Elasticsearch an order of magnitude faster.
At the end of 2017, I was developing an AMI for the Cassandra benchmark, which runs on stateful instances. Non-resident instances have fast iteration times because we can deploy new Cloud Instances to replace old ones. In contrast, stateful Cassandra instances were slow to deploy as they needed to copy all the data from the old instance to the new one. I was wondering if I could achieve benchmark-free agility with Cloud Instances running on ephemeral, stateless storage. I started looking for ways to bake AMIs like usual but deploy them to already running EC2 instances. I asked: is there a way to overwrite the old AMI with the new one?
AT EC2 AMI It’s basically just a picture. EBS, packaged along with some additional metadata about how to start an instance (there are also “instance-store” AMIs that are supported by objects in S3, but they are less common). When the instance starts, EC2 creates a new EBS volume from the snapshot, and this EBS volume becomes the virtual machine’s root file system. In my first attempt, I was trying to figure out a way to swap the EBS root device of a newly launched instance with the root device of the old instance. This approach didn’t work: EC2 (reasonably) doesn’t allow you to detach an EBS root device from a running instance. I also couldn’t stop the instance, because stopping the instance would deprovision its ephemeral storage, which would require a lengthy data transfer from other cluster members. The goal here was speed, so any solution that required moving terabytes of data was unacceptable.
I then looked for ways to change the contents of the EBS root volume to exactly match the newly created AMI. Other approaches could use configuration management (eg Puppet, Chef, Ansible) to make the necessary programming changes. The problem with these tools is that they introduce drift and cannot guarantee that the machine state they generate will exactly match the AMI that was passed through our build and test pipeline.
I ended up choosing a simpler approach: a script that reloads the instance into an in-memory operating system (similar to a Linux LiveCD). On startup, the bootloader accesses S3, loads an image that is byte-by-byte identical to the EBS snapshot contained in the baked AMI, and “flashes” it to the root device. After this process, the file system is indistinguishable from the root device of the just started AMI. Reboot again and the machine will boot into the new OS image instead of the old one.
When I presented my benchmark results to the team, I mentioned how I flashed AMIs to increase iteration speed, and that this could be a useful technique for other team members to use when integrating testing their own AMIs. My colleague Joey Lynch replied: “Forget about using this in dev, what if we used this in prod!”.
Separately from my work, Joey and our Cassandra team were evaluating ways to update the OS on our fleet of tens of thousands of instances. Netflix moved from Ubuntu Trusty to Ubuntu Xenial (see also Ed Hunter talk at LISA18). In addition, the team wanted to roll out a new version of the data warehouse itself. Given our size, we wanted to minimize the human involvement required to make this transition, and (of course) we wanted to do this with as little risk as possible.
Before moving on to an in-place upgrade, we looked at two other classes of approaches to solving the OS upgrade problem:
We’ve previously used Ubuntu’s built-in upgrade capabilities on a much smaller scale, but the idea of using them for a fleet of our size was unappealing. We would have to divert nodes from production traffic while each one was updating, and each node would take longer to mutate than just flashing a 10 GB OS image. Also, there was no guarantee that the upgraded servers would have the same configuration as the freshly uploaded ones: each upgraded server would be its own unique snowflake.
Moving data to new instances was attractive because it allowed software and hardware changes to be made in one step. Unfortunately, this solution is more risky and resource intensive than leaving the data in place and changing the operating system. The risk increases because data movement creates the possibility of data corruption, while in-place flashing leaves the amount of data intact. In terms of resources, an EC2 i3.8xlarge instance can store up to 7600 GB and has a 10 Gb/s NIC. Even if we were to saturate the instance network card, it would take 1 hour and 41 minutes to transfer the dataset. In the end, we came to the conclusion that data migration solutions are desirable, but we don’t want to block software upgrades in a separate project to enable hardware replacement.
Note: This is not a unique situation.
Upgrading a computer by flashing a new disk image over an old one is somewhat unusual in the cloud, but quite typical of other areas of computing. For many types of embedded devices and network equipment, all updates are performed in this way. In two popular operating systems known to me, ChromeOS and CoreOS, this update method is the main one. (And CoreOS runs in the cloud!).
Most systems that are updated by flashing have two partitions, primary and backup. This allows them to flash the image to the spare partition without rebooting into an in-memory operating system, but requires additional functionality from the bootloader to switch between the two partitions. The rest of Netflix doesn’t use this kind of functionality, so creating it would put our team “dead end” in the support path. Such changes would also break our desire to maintain byte-by-byte fidelity between the root device of the updated instance and the root device of the newly launched AMI. In fact, the only clue that the update happened in place is the label of the instance EC2 that we are building for our infrastructure audit systems to use.
When the machine reboots, it unfortunately loses the OS page cache it used to speed up disk reads. Our instances rely on hot page cache data to achieve baseline latencies, so starting with an empty page cache will result in minutes to hours of SLO-violating latencies. To avoid this problem, we used a tool I wrote called happycache to reset cached page locations before reloading. As soon as the system recovers, happycache reloads pages that were previously cached. This significantly reduces the observed delay caused by the reboot.
Conclusion: how did it go?
After we improved our toolkit, we could update the nodes and serve production traffic at intervals of 10 minutes. Only about 5 minutes of that time was dedicated to the update itself, and the rest of the time was used to gracefully remove the host from production traffic, boot the new OS, and reload cached pages from disk with happycache.
We have completed our first upgrade in a few weeks, successfully migrating from Trusty to Xenial and rolling out the new Cassandra distribution. We now use this technique to regularly update both Cassandra and Elasticsearch. Having such a tool enabled us to create an integrated pipeline that tests new versions of our AMIs and deploys them when they pass, providing true CI/CD for stateful services on Netflix. But for me, the biggest achievement was that we were able to tell the security team: “whatever the patch, we can fully roll it out in 24 hours.”
Joey Lynch for his work in creating the s3-flash-bootloader and for initially pointing out the usefulness of this technique in production.
Harshad Fadke for adapting s3-flash-bootloader for use with Elasticsearch.