Over the past 20 years, software and hardware architects have tried various strategies to solve the problems associated with big data. As programmers painstakingly rewrote code to scale horizontally across multiple machines, hardware makers cram more and more transistors and cores into each sip to increase the amount of work that can be done on each machine.
As anyone who has ever taken a programming interview will attest, when there is an arithmetic and exponential progression, the geometric one will always prevail. With horizontal scaling, costs grow linearly (arithmetically). But according to Moore’s law, computing power grows exponentially (geometrically) over time. This means that you can do nothing for several years, and then scale the system vertically – and get orders of magnitude better. In twenty years, the density of transistors has increased 1000 times. This means that a task that would have required thousands of machines in 2002 can now be accomplished with just one.
With such a dramatic increase in hardware capacity, it is appropriate to ask: “Do the conditions still exist that in 2003 allowed us to solve scaling problems”? After all, we have greatly complicated our systems and introduced a lot of overhead into them. Is all this really needed? If you can now complete a task on a single machine, wouldn’t this be a better alternative?
In this post, we’ll take a closer look at why scaling-out has become the dominant practice, check if those rationales are still valid, and then explore some of the benefits of scaling-up architecture.
Why horizontal scaling is needed at all
First, let’s expand the context a bit. Twenty years ago, Google ran into scaling issues when they sent their spiders around the world wide web in an effort to index it. As a rule, they coped with this by buying more expensive cars. Unfortunately, at that time Google did not have much money, and, regardless of the costs, they still had to hit the limits of what was possible, as the “web scale” itself, in turn, grew exponentially.
At Google, in an effort to index any site in every corner of the Internet, they invented a new computational model. By applying functional programming and algorithms to distributed systems, Google became almost limitless without having to buy “fabulously expensive hardware.” Google didn’t buy big computers, but just bundled a lot of relatively small ones with cleverly designed software. It was “scale-out” to many machines, not “scale-up” to larger machines.
Google very quickly published three articles, one after the other, that completely changed the general views on how to build and scale software systems. These were the following articles: GFS (2003), which dealt with the topic of data warehousing, MapReduce (2004) on computing and BigTable (2006), which dealt with the rudiments of a database.
Doug Cutting, who implemented the techniques described in these articles and made them freely available, said: “Google jumped into the future for several years and sends us news from there.” (Unfortunately, Google has not succeeded in designing timeflights, and the main result achieved so far remains a pretty junk goat teleporter).
When I first read an article about MapReduce, I felt like Google had invented a whole new way of thinking. I joined Google in 2008 hoping to partake in that magic. Shortly thereafter, I connected to the output of their query engine. Dremel in production – and it needed to be scaled horizontally. Soon it was renamed to BigQuery.
It is difficult to overestimate the huge impact that the changes that were announced during the revolutionary transition to horizontal scaling had on the industry. Today, if you’re building a “serious” infrastructure, you’ll have to scale it out horizontally through a complex distributed system. This has led to the popularization of new technologies, such as consensus protocols, new ways of deploying software, and a more relaxed acceptance of loose consistency. Vertical scaling has survived only when working with legacy codebases that are strongly associated with older single-node architectures.
Perhaps my workload will not fit on one machine
The first and main driver of horizontal scaling is that people feel like they need many machines to process their data. Once I wrote a long fast, in which he argued that “big data is dead”, or rather, that data sets are now becoming smaller than commonly thought, and workloads are even smaller, so a significant part of the data is not used at all. If you don’t have big data, then you almost certainly don’t need scale-out architectures. I would not like to retell these arguments here.
You can start with a simple chart showing how large AWS instances have grown over time. Today, machines with 128 cores and terabytes of RAM are widely available. Same number of cores as a Snowflake XL instance, but four times the memory.
IN article o Dremel shows several benchmarks running a 3,000 node Dremel system on an 87 TB dataset. Today, equivalent performance can be obtained on a single machine.
At the time, it seemed that Dremel simply couldn’t work without relying on indexes or precomputed results. All the other database players tried to avoid table scans, but Google decided, “Nah, we’re just going to scan tables really fast, turn every query into a pass.” Throwing machines in packs to solve problems, it was possible to achieve such productivity, as if it could not have done without sorcery. Fifteen years later, you can achieve similar performance without resorting to any magic, not even distributed architectures.
In the appendix to this article, we will mathematically analyze how such a level of performance can be achieved on a single node. This chart shows different resources compared to those from the above article. The higher the bar, the better.
Here we see that a single machine can provide the performance of a 3000 node Dremel cluster under hot and warm cache conditions. This makes sense, given that many scale-out systems, and Snowflake in particular, use solid-state drives for caching in order to improve performance. If the data was “cold” (it was in an object storage like S3), we can still achieve the required performance, but for this we need an instance of a different type.
It would be possible to scale vertically, but the larger the machines, the more expensive they are
Vertical scaling is a sharp increase in costs. Want a car twice as powerful? It can cost you several times more powerful than the existing one.
In the cloud, everything runs on virtual machines, which are relatively small fractions of much larger servers. Most users don’t care what the exact sizes of these segments are, since there are few workloads that would require the entire machine to be used. These days, hardware is so powerful that cores are often counted in the hundreds, and memory is measured in terabytes.
In the cloud, you don’t have to pay extra for “cool hardware” because you already work on it. You just need a bigger slice. Cloud vendors don’t scale with size, so the cost of using a machine per unit of computing power doesn’t change, whether you’re running a tiny instance or a giant one.
This task is easier to imagine in terms of how much it costs to reach a given level of performance. Previously, larger servers cost more per unit of processing power. Today, in today’s AWS clouds, the price for a given amount of computing power remains constant until you get very large.
The second benefit of working with the cloud is that you don’t have to keep spare hardware on hand. Your cloud provider is responsible for this. If your server goes down, AWS will restart your workload on a new machine without you even noticing. In addition, the provider is constantly updating the hardware in the data center, and many important improvements are made completely without your participation.
Cloud architects also ensure that data stores and compute nodes are separated, which is why compute instances often store the least amount of data. Thus, in the event of a failure, you can very quickly raise a replacement instance, and no data will have to be reloaded. This reduces the need for hot standby.
Horizontal scaling is more reliable than vertical scaling
Scale-out architectures are generally considered more robust; they are designed to remain operational even under various failure conditions. However, systems with horizontal scaling have not been able to seriously improve reliability, and with vertical scaling they are usually distinguished by solid reliability.
External factors often play a determining role in cloud availability. For example, someone can hit the wrong key when configuring and reset the cluster size (a few years ago, this situation briefly developed in BigQuery). Network routing can break (this is how Google’s historic failure happened, in which many services collapsed at once). It may be an authorization service that you depend on, etc. The specific performance level specified in the SLA may depend primarily on these factors, and therefore, correlated failures are possible, occurring simultaneously in many systems and products.
As practice shows, providers of cloud scale-out databases and related analytics provide SLA level 4-9 (availability in 99.99% of cases). On the other hand, those who work with their own vertically scalable systems maintain a similar level of reliability over time. Many banks and other businesses use mission-critical Tier 5 and Tier 6-9 systems, and these systems are running on vertically scalable hardware.
Reliability also depends on both durability and availability. Critical arrows against vertically scalable systems are aimed, in particular, at the fact that in case of failure you will need to take a data replica from somewhere. In principle, this problem is solved by separating computing nodes and data storages. In a situation where the final destination for the data is not the machine on which it is computed, you no longer have to worry about how long the given computer will last. The underlying shared disk infrastructure offered by cloud providers such as EBS or Google Persistent Disk relies on extremely durable storage under the hood. Therefore, it is possible to guarantee a significant longevity of applications without interfering with their operation.
How to Stop Worrying and Love Single Systems
So, above, we considered that the three main arguments in favor of a horizontal approach – scalability, cost and reliability – are no longer as weighty as they were decades ago. Vertical scaling also has some forgotten benefits, which we will discuss below.
KISS: KEEP IT SIMPLE, STUPID
Simplicity. Horizontally scalable systems are much more difficult to build, deploy and maintain. While developers like to talk about the merits of Paxos compared to RAFTs or CRDTs, it’s hard to argue that with the introduction of these technologies, the system becomes much more complicated both in assembly and in support. It is difficult for us mere mortals to judge these systems, to understand how they work, what happens when they fail, and how to restore them.
Here is taken from Wikipedia network diagram, describing the “simple” way of Paxos (distributed consensus algorithm) – failures are not shown here. If you are building a distributed database and want to handle writes to more than one node, you should end up with something like this:
The protocol itself is not as important here as the fact that this is the simplest case of one of the most elementary distributed consensus algorithms. On a single-node system, these algorithms are not needed in principle.
Writing programs for distributed systems is stupidly more difficult than for single ones. In distributed databases, it is necessary to provide for the transfer of data between nodes for joins, about data alignment on certain nodes. Single-node systems are much simpler: to combine data in such a system, a hash table is simply created and shared pointers are set. There will be no unrelated failures from which you would have to recover.
The disadvantages of complexity are felt not only by programmers who build systems. Abstractions flow, so things like eventual consistency, storage sharding, and failure domains are the responsibility of developers and end users. The CAP theorem is a reality, so users of distributed systems have to skillfully navigate between consistency, availability, and also have a plan in case of network failures.
Deploying and maintaining single systems is usually much easier. Such a system is active or inactive. The more dynamic elements in it, the more problems can happen with a higher probability. In a single system, there is only one node to look for problems, and it is easier to diagnose them.
PERF: last frontier
If you have to choose between a fast and a slow option, you will almost certainly choose the fast one. Single-node systems seriously outperform distributed ones in performance. If you think about them out of context, then with each extra network transition, the system will definitely work more slowly than without it. If things like consistency protocols are added to the system, this trend will only get worse. In a single-node database, a transaction can be completed in a millisecond, while in a distributed database, such an operation can take hundreds of milliseconds.
Distributed solutions can indeed improve overall throughput when the system is dependent on the network, but they also place a significant overhead on the network for non-trivial operation. For example, if you want to merge two distributed sets of data (and they were not neatly evenly segmented), then there will be a significant delay when shuffling the data. In distributed systems, it is simply impossible to perform arbitrary joins as quickly as on a single system.
As an example of why a distributed architecture tends to limit performance, consider the following stylized architecture diagram of BigQuery:
A petabit network that connects all nodes would seem to be fast – but it’s still a bottleneck, because we have too many operations that require data to be transferred over the network. Most non-trivial queries are network dependent. In a single node system, much less data will need to be transferred, since there is no need to shuffle it.
Hope for Mr. Moore
We’ve looked at the rationale for scaling out and found that it’s far less relevant now than it once was.
Power. Today’s single node systems are huge and can handle almost any workload.
Reliability. A horizontally scalable system is not necessarily more reliable.
Price. Cloud vendors do not charge additional fees for escalating virtual machines, so growing is only more profitable.
We also discussed some of the advantages of vertical scaling:
Simplicity. Simple systems are easier to build, maintain, and improve.
Performance. A distributed system is always characterized by increased latency (especially tail latency) compared to a non-distributed system.
Let’s say now you’re not sure whether to go for horizontal scaling. But what will happen in five years, when the cars will be an order of magnitude larger?
We are ready for a new generation of solutions that will take the performance of single-node configurations to a new level. Innovation will accelerate, and you can focus on solving problems rather than coordinating complex distributed systems.
Application. Briefly about Dremel
In this appendix, we compare the results obtained in the Dremel article with running a similar workload on a large machine with modern hardware. While it would be nice to get and demonstrate a practical result, a lot can be judged from the benchmark data, including how fair the comparison is. We will demonstrate what modern hardware can do.
Authors articles worked with a 3000 node Dremel cluster. For reference: such an array of equipment in BigQuery would cost you over $1M per year. Compare it with instance i4i.metal on AWS, which costs $96k per year, has 128 cores and 1 terabyte of RAM. Let’s run them “on neighboring tracks” and see who wins.
Here is an excerpt from an article that describes how breakpoints were calculated for comparison with MapReduce:
Task 1: SELECT SUM(CountWords(txtField)) / COUNT
The following figure shows on a logarithmic scale how long the two MapReduce and Dremel jobs take to complete. Both MapReduce jobs are distributed among 3000 workers. At the same time, a 3,000-node Dremel instance is used to complete Task 1. Dremel and the column version of the MR each read about 0.5 TB of compressed column data – compare this figure with 87 TB read by the line version of the MR.
In the Dremel article, the main performance result showed that the system was able to scan and aggregate a query of over 85 billion records in about 20 seconds, and half a terabyte of data was read in the process. Orders of magnitude faster than systems based on MapReduce. In addition, all this far exceeds the results that could be achieved on traditional vertically scalable systems of the time.
To match the performance levels given in this article, you would need to scan 4.5 billion rows and read 25 gigabytes of data per second.
The SingleStore blog once had a post claiming that their system could scan over 3 billion rows per second per core. Accordingly, on a machine the size of our I4i, 384 billion lines per second could be processed – that is, almost two orders of magnitude more than to keep up with the Dremel. Even if it takes 50 times more processing power to count the words in a text box, the power headroom is still comfortable enough.
The data transfer bandwidth on a single server is likely to be in the terabytes per second, so no problems are expected here. Since the data is in memory “at hand”, we should easily read 500 GB in 20 seconds. The columns used in the query will take up about half the memory on the machine, so if we cache them ahead of time, we have about half a terabyte of memory available for processing data or for storing an inactive cache. But it seems that this is not entirely fair, because here it is required to know in advance exactly how many columns will be needed to store the query in memory.
But what if we store the data we need on a local SSD? In many databases, such as Snowflake, local SSDs serve as online storage for local data. I4i servers have a total of 30 TB of non-volatile express memory in SSD. Thus, 30 times more data can be cached on an SSD than in memory, and 60 times more than for the query considered here. It is quite rational to try to cache the active columns of this query on the SSD.
If capacity isn’t an issue, what about throughput? Drives with non-volatile express memory are fast, but are they fast enough? In the cases described, they allow a total of 160 thousand IOPS operations per second, where the maximum size of each operation is 256 KB. Thus, it is possible to read at a speed of 40 GB / s, that is, 25+ times more than we need. The stock is not so large, but the model is workable.
Finally, what if we wanted to work in “cold” mode, where no data is cached? After all, one of the main strengths of the Dremel was the ability to read data directly from object storage. Here we run into a limitation: the network capacity of an I4i instance is limited to 75 gigabits/s, or approximately 9 GB/s. This is about a third of what we would need to read directly from object storage.
There are instances that have much higher throughput (in terms of memory): for example, TRN1 instances have 8 100-gigabit network adapters. Thus, it is possible to work at a speed of 100 GB / s, which far exceeds our need.
Of course, the mere fact that the necessary hardware is available on the machine does not mean that it will be uniformly available, and that performance will grow linearly with the increase in the number of cores. Operating systems do not always cope well with multiple cores, locks do not scale well, and programs need to be written very smartly so as not to crash into a wall somewhere.
The point is not to compare the relative efficiency of different systems; after all, these measurements were made 15 years ago. But, I hope, I was able to show that when approaching workloads to a dataset of about 100 TB in size, it makes more sense to operate them on an undistributed system.