Further in the article, some facts will be marked with special symbols:
Features that can be very useful in practice.
Limitations, which should be remembered, but if you do not know about them and discover during development – not a problem, they can be circumvented.
Limitations that can become a big problem if you did not take them into account at the design stage.
Cosmos DB supports several APIs: SQL, Cassandra, MongoDB, Gremlin, Table. Here, SQL refers to a document-oriented API, which used to be called DocumentDB, and it differs significantly from the usual relational databases. This article is based on experience with the SQL API. I want to draw attention to the fact that the type of API is selected when creating the storage instance, it will not work with the data of the same instance through different APIs.
The Cosmos DB document model stores data in containers consisting of items. Earlier in the documentation, they were called collections and documents, respectively. All settings for scaling, bandwidth, and indexing are indicated at the container level (with some exceptions, which we will discuss later). A database, by and large, is a named union of containers.
And here it’s worth immediately mentioning the first limitation:
The maximum size of an element in a container is 2 MB. It should not be a problem in most cases, but remember that size is limited and storing media content in container elements is not a good idea. More information on resource limits can be found. here.
Cosmos DB allows you to manage performance metrics (bandwidth) for each container individually.
Throughput is measured in units of request per second (Request Units per second or abbreviated RU / sec). An approximate equivalent of a request unit can be considered reading an element of 1 KB in size by its identifier. For example, the required throughput of a container that allows you to make 500 reads and 100 records of one-kilobyte elements per second will be approximately equal to 500 * 1 + 100 * 5 = 1000 RU. However, in the general case, it is practically impossible to calculate the required throughput with accuracy, since the complexity of the requests and the size of the elements can be very different.
Any operation performed by the database has its own “price” in terms of RU. The throughput indicated on the container is the limit that Cosmos DB allows to “spend” per second. The base monitors the total “price” of requests within a second and if the limit has already been reached, subsequent requests will not be accepted until the amount per second returns to a value less than the specified limit.
The minimum possible throughput for a container is 400 RU and will cost about $ 25 per month. Cost increases linearly with increasing RU – a container with 800 RU will already cost about $ 50 per month.
Throughput is indicated at the container level and can be changed at any time without any additional procedures.
It is possible to specify bandwidth at the database level. In this case, all containers will use the same “capacity”.
There is also an “autopilot” mode (currently in the preview stage), which allows you to automatically increase or decrease the declared bandwidth depending on the load on the database. Its disadvantage is that the minimum and maximum values that you can configure are in the ratio of 1:10, i.e. you cannot make it so that at peak hours the metric is set to 40000 RU, and at rest – 400 RU.
In case of exceeding the declared bandwidth, Cosmos DB simply does not take new requests for execution and in response returns a special status of HTTP 429 (“Request rate too large”).
To switch between configuring bandwidth at the base level and configuring at the container level, the database must be re-created, a smoother transition has not yet been provided.
Partitioning for infinite scalability
Cosmos DB used to have two options for containers: with partitions (partitioned) and without (non-partitioned). Partitionless containers were limited by a maximum throughput of 10,000 RU and a size of 10 gigabytes. At the moment, all containers can have partitions, so when creating them, you must specify the partition key.
An important point is to understand the difference between logical and physical partitions: logical consists of a set of elements that have the same partition key value, while the physical partition is the Cosmos DB “computing unit”, a node of its physical infrastructure that can process several logical partitions. Depending on the amount of data and the distribution of records by the partition key, the system can create new physical partitions and redistribute logical partitions between them.
All the bandwidth allocated to the container is evenly distributed among its physical partitions. Imagine that initially a container with a bandwidth of 1000 RU has one section. When its size reaches the limit of 10 GB, the system will divide it into two parts. This will mean that if initially any request running on this container had at its disposal 1000 RU per second, now requests related to the same partition will already be limited by a bandwidth of 500 RU.
Although the number of partitions is not limited, the maximum amount of data for one physical partition is 10 GB and the maximum throughput is 10,000 RU. The partitioning key should be chosen so that the probability of reaching these limits is minimal. It should be noted that the same logical partition cannot be divided between several physical ones.
Instead of creating indexes for individual fields or their combinations, Cosmos DB allows you to customize indexing policy for paths inside an object. A policy is a set of attributes: which ways to include in indexing, which to exclude, what types of indexes to use, etc.
An interesting point is that, unlike most other DBMSs, Cosmos DB uses an inverted index instead of the classic B-tree, which makes it very effective when searching by several criteria and does not require the creation of composite indexes for search across multiple fields. More details on this topic can be found in the article on this link.
Indexing policy can be changed at any time.
Two indexing modes are available: consistent (consistent) and lazy (lazy). Lazy makes recording faster, but harms the consistency of writing and reading, because in this case, indexing occurs in the background after the write operation has completed. While the index is being updated, queries may return irrelevant data.
You can create compound indexes to speed up the ORDER BY operation across multiple fields; in other cases, compound indexes are useless.
There is support for spatial indices.
The indexing policy created on the default container (“index all fields”) can cause a large RU consumption when writing items with a large number of fields.
Tracking changes in Cosmos DB is possible thanks to a mechanism called Change feed. It returns documents that were changed from a certain point in time in the order in which they were changed. This initial moment of time can be flexibly controlled: it can be either the moment of the initialization of the flow of changes itself, or a fixed time stamp, or the moment the container is created.
Changes can be processed asynchronously and incrementally, and the result can be distributed among one or more consumers for parallel processing. All this gives very great flexibility in various integration scenarios.
If you use the low-level API for Change Feed, be sure to consider the partition split, which can occur when the partition reaches its maximum size.
This mechanism does not track deletions, so it is better to use soft delete (soft delete) in order to be able to respond to the removal of elements.
Stored Procedures and Transactions
There are no convenient ways to return additional error information from stored procedures. The only way to solve the problem is to add information to the error message, and then on the client “cut” this data from the message.
Transactions work only within stored procedures and functions or transactional packagesAvailable since .NET SDK version 3.4 and higher.
Only documents from the same logical partition can be included in a single transaction. Accordingly, write operations to different containers cannot be performed transactionally.
One stored procedure execution time is limited (5 seconds), and if the transaction duration goes beyond this, it will be canceled. There are ways to implement “long-lived” transactions through several server calls, but they violate atomicity.
Although Microsoft calls the document API “SQL”, its language is just similar to SQL and has many differences from what we are used to seeing in relational databases.
The API has parameters that help prevent the execution of “expensive” queries: EnableCrossPartitionQuery, EnableScanInQuery (see class Feedoptions in the documentation). Cross-partition queries are received when the condition does not contain a fixed value for the partition key. Dataset scans can occur when a query condition contains non-indexed fields. Setting both parameters to false is a good way to deal with excessive consumption of RU. However, in some cases, running a query on multiple partitions may be useful.
The GROUP BY clause is already supported (was added in November 2019).
The aggregate functions MIN and MAX do not seem to use indexes.
The JOIN keyword is present in the language, but is used to “expand” nested collections. Connecting elements from different containers in one query, like regular SQL, will fail.
Since Cosmos DB does not imply a strict data scheme, the field specified in the condition may not exist in some elements. Therefore, any query that imposes conditions on optional fields should also contain a check of their existence through the IS_DEFINED function. Checking for NULL may not be enough.
More tips about querying can be found in cheat sheetspublished by Microsoft.
Other useful features
- Element lifetime – can be specified either by default at the container level, or the application can set the lifetime for each element individually.
- Five integrity levels: bounded staleness, session (default), consistent prefix, eventual.
- Unique keys (note that to change the structure of the unique key you will have to recreate the container).
- Optimistic block – each element has a special field “_etag”, updated by the DBMS itself. The application code during the update of an element can indicate a condition – allow recording only if the value of the “_etag” field in the object transferred by the application is equal to the value stored for this element in the database.
- Geo replication – It is very easy to configure for any number of regions. In a configuration with one “master”, you can switch the main region at any time, or this switch will be automatic in case of a failure at the data center level. The client from the SDK automatically responds to these switches, without requiring any additional actions from the developers to handle such situations.
- Data encryption
- Record in several regions – allows you to scale recording operations. It should be remembered that the presence of several copies of data that allow recording almost always implies possible conflicts: the same element is changed in two regions at the same time, several elements with the same primary key are added in different regions, etc. Fortunately, Cosmos DB provides two ways to resolve conflicts: automatic (the last write operation wins) or “your own version”, in which you can implement an algorithm that suits your conditions.
As you can see from the above, Azure Cosmos DB has a large number of advantages that make it a good choice for various projects. But there is nothing ideal, and some restrictions can be a serious obstacle to applying this technology: if you need transactions consisting of actions on several containers, or you need long transactions that include hundreds and thousands of objects, if the data cannot be effectively partitioned and at the same time they can go beyond 10 GB, etc. If none of the limitations mentioned in this article seems like a big problem for your project, it makes sense to consider Azure Cosmos DB.
Thanks to l0ndra for helping me prepare this article.
P.S. This article I already posted earlier in English, but on a different resource.