Selection of a database using the example of a URL shortener

It’s even a little scary to think that a few years ago, when the use of k8s grew to today’s scale, people suggested and even tried to deploy databases in it with screwed volumes near their applications. Speaking about the design of highly loaded systems, albeit with a minimum of business logic, sometimes you even think about bare-metal implementation, comparing it with virtualization (sometimes second-order in some companies). To avoid such thoughts, I decided to think for myself how to organize something simple, but scalable, and which the requirements for a load of 1 request per second definitely will not fit.

URL Shorteners, or how to skip a couple of multiple requests per hour

Disclaimer: I will not delve into the business logic of the application, it is very trivial and the question is about the principles of data storage, not their processing.

Speaking about the Internet in its current form and remembering advertising campaigns, convenient short links for chats or SMS messages that hide kilometers of UTM tags or home referral parameters for data satanists, they definitely have a place to be. Services like bit.ly or tiny.url also manage to let passers-by through their fronts with fake ads. How can this be organized?

Let’s pretend that we are our own business and move our own demands. What should it be able to do?

  • Shorten links to an acceptable state

  • Count the number of clicks on links (keeping statistics on the type of IP addresses, UserAgents and click time)

  • Work fast.

Asking questions about business entities:

  • How long should a link live? Always until we decide to remove it by hand

  • Can I create my own custom links? Let’s say that it is possible, given the ID length limit of 8 characters

  • How many link building requests do we expect in one month? Let’s say the number is 100 million

  • What should be the uptime? We measure uptime and apdex so that we look at the 99.995th percentile of users.

  • How many clicks are planned? Let’s say we have a very optimized campaign and say we have a ratio of 100:1 per read:write (100 hits to one link build).

Looking at the requirements, you can roughly calculate how many links will be generated by clients. 100 million / (30 days * 24 hours * 60 minutes * 60 seconds) = 40 RPS, for which the 99.995 percentile requirement applies. Based on the ratio to hops, the service needs to serve 4000 RPS for hops per second during busy periods.

By imagining more or less how much it would cost to store one response to a short link http request with a 302 code and a redirect (about 300 bytes), we can calculate that the backend throughput per day will be 4000 RPS * 86400 seconds = 345,600,000 requests per day. Obviously, with 346 million read operations per day, it will not help us much to go to the database for the same thing every time, but there is the Pareto principle – 80% of queries are always for 20% of the data. Given the simplicity of the service, the first thing I want to do is cache all sorts of things in IMDB (in-memory database), and memcached or Redis is great for such purposes. And given the Pareto principle – we need to cache 20% of these requests – the memory requirements would be in the region 346 million * 0.2 * 300 bytes = 21 GB caches – not even scary, but it will help us provide almost 99.995% of customers with quick redirects to where they need to.

So, to sum up the math:

  • Number of links created per year: 120,000,000

  • Number of clicks on links per year: 126,144,000,000

  • Required storage space: 335.6GB

  • Required amount of memory for caching: 21GB.

It doesn’t look like a bigdata, but it makes you think about the advisability of using classic databases to store links.

Top level architecture

Given simple User stories:

  • I, as a user, want to be able to create short links;

  • I, as a passer-by traffic generator, want to be able to get long links by clicking on short ones;

  • I, as an analyst, want to be able to see link clicks and upload reports in real time

And the architecture that we agreed on (caching database responses so that the database can breathe), it’s time to think about how to store links without data loss and let the database breathe some more.

So what’s up with the bases?

Time to compare the characteristics of the patterns on which the storage of everything in the world is built:

RDBMS

The most common approach to the database

The most common approach to the database

The most common database architecture in the world and for good reason. Rigidly set at the level of database schemas, restrictions and rules will not allow you to insert garbage, and indexing tables will help you quickly collect the necessary data model. But do we need complex data models? Information about transitions in realtime is streamed to OLAP, and a lot of read and write requests to the master node will lead to its degradation and will not give us 99.995% availability.

NoSQL databases

This is where things get interesting – let’s say we use the id of a short link as a key, and a long link as a value. We need indexes on objects, so we should look towards Apache Cassandra: Zookeeper is slow for this, and Cassandra scales easier and faster than the same MongoDB.

At the same time, Cassandra can and guarantees consistency on a small number of nodes, but as the number of replicas grows, questions about the quorum arise:

  • How many cues must be responded to on write before a response is returned to the client?

  • What to do with replicas that the write request did not reach?

So let’s move on.

Best of both worlds

We already roughly understand what we need from the physical data storage layer: speed of read and write operations and replication and consistency so that important links are not lost. Guess what happened? The shards fell apart.

Shards and sharding patterns

I think that few have not yet heard or seen with their eyes the mention of the term “shards”. What does it mean?

Speaking of sharding, it seems like it should be compared to partitioning. In short, partitions are virtual dataspace regions within a single physical server, and partitions are multiple dataspace regions, each spread across multiple servers. If explained in the PostgreSQL paradigm (perhaps it will be easier for someone), then

  1. I can divide one table into partitions on one or another basis (for example, time)

  2. I can create new tables using the partitioning rules I created (master table slicing)

  3. The implementation of sharding in PostgreSQL is based on partitions, that is, there is one centralized (or replicated) master table and its partitions on neighboring servers.

On the other hand, let’s take Elastic as an example. An index pattern is implemented there, which allows you to keep certain data side by side, as in classic RDBMS, but sharding algorithms allow you to divide data into dozens of shards interconnected by this index, allowing you to distribute data and not letting one conditional master table die, because the data is already distributed without the mentioned master table. And taking into account the requirements for our service – once the created link will not change, and there is no need to think about blocking strings or objects during simultaneous requests.

Distribution of data across shards can be both algorithmic and dynamic:

Algorithmic sharding

Algorithmic sharding

  • Algorithmic scatter implies given rules by which data is written to shards: if we have an index, we can shard it by the date of entry or another parameter and this will give us separate shards, the search for which will be trivial.

Dynamic sharding

Dynamic sharding

  • Dynamic spread of data is implemented by “locators” – these are coordinators who decide where what will lie depending on the index. Locators can be one or more, depending on the number of indexes and the need for it, but because of this, ideologically, locators are Single Point of Failure (single point of failure) for a certain set of indexes and the availability of certain data under load is not guaranteed.

The story with sharding is well implemented in ClickHouse with its distribution. ClickHouse, when properly configured, does not have a single master, shards are replicated asynchronously (but very quickly, consider network delays) and its implementation allows you to see 4000 RPS per read operation thanks to shards. Also, ClickHouse executes SELECT subqueries on all shards in the cluster, because it doesn’t immediately know where everything is, so it’s hard to lose anything.

Let’s now assume that we have made it so that the shard index is in the link. Generating a short link in 8 characters of English upper and lowercase with 10 digits (62 characters for everything about everything), we get 62!/(8! * 54!) = 1.8*10152 links. In this case, using the first two characters as a key / index, we get 62!/(2! * 60!) = 1891 a unique index into which links can be dynamically distributed.

Is everything so good?

No, and this must be understood – the decision to use any of the databases that I talked about carries its own risks. Thinking in the sharding paradigm, you need to understand that the base logical shard is an atomic unit and, accordingly, it is tied to one node, and with replication, the question of the cost of data storage and vertical scaling will immediately arise. Even with randomized data distribution (given that the short link is random every time), there can be potential problems with unbalanced space consumption, and then data integrity with poorly configured replication. With a large number of shards and a large number of read requests, SELECT subqueries can become a bottleneck without caching. Data consistency also depends on the design and correct insertion of things where they should be in the correct form.

In conclusion…

In this article, I only touched on the basic ideas of data distribution and building a highload on hardware within the framework of “thoughts in the soul”. Replication strategies can be based on shard rules or selected dynamically, regional replication is based on the number of requests to servers in certain regions… The main thing here is the understanding that none of this is much magic and everything is built quite logically, based on business processes and things we see as bottle necks. Query performance between partitions does not even remotely smell like a highload, and shards, the idea of ​​​​which is to alleviate this suffering, must be more or less evenly distributed and finding a balance between convenience, performance and cost is not always easy. Many web services still work without sharding and live somehow.

That’s all part 1, good luck guys!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *