14 things I would like to know before getting started with MongoDB

Translation of the article was prepared on the eve of the start of the course “Non-relational databases”


Highlights:

  • It is imperative to design the schema even though it is optional in MongoDB.
  • Likewise, indexes must match your schema and access patterns.
  • Avoid using large objects and large arrays.
  • Be careful with MongoDB settings, especially when it comes to security and reliability.
  • MongoDB does not have a query optimizer, so you must be careful when performing query operations.

I have been working with databases for a very long time, but only recently discovered MongoDB. There are a few things I would like to know before getting started with it. When a person already has experience in a certain area, they have preconceived ideas about what databases are and what they do. In the hope of making it easier for other people to understand, here is a list of common mistakes:

Creating MongoDB Server Without Authentication

Unfortunately, MongoDB is installed without authentication by default. It is normal for a workstation to be accessed locally. But since MongoDB is a multi-user system that loves to use large amounts of memory, it’s better if you put it on a server with as much RAM as possible, even if you’re only going to use it for development. Installing on a server via the default port can be problematic, especially if you can execute any javascript code in the request (for example, $where as an idea for injections).

There are several authentication methods, but the easiest is to set a user ID / password. Take advantage of this idea while you think about fancy authentication based on LDAP… When it comes to security, MongoDB must be constantly updated and logs should always be checked for unauthorized access. For example, I like to choose a different port as the default port.

Remember to bind the attack surface to MongoDB

MongoDB Security Checklist contains good tips to reduce the risk of network penetration and data leakage. It’s easy to dismiss it and say that a development server doesn’t need a high level of security. However, things are not so simple and this applies to all MongoDB servers. In particular, unless there is a compelling reason to use mapReduce, group or $ where, you need to disable the use of arbitrary JavaScript code by writing in the configuration file javascriptEnabled:false… Since the data files are not encrypted in standard MongoDB, it is smart to run MongoDB with Dedicated User, which has full access to files, with limited access only to him and the ability to use the operating system’s own file access controls.

Circuit design error

MongoDB does not use schema. But this does not mean that the circuit is not needed. If you just want to store documents without any consistent scheme, saving them can be quick and easy, but retrieving them later can be damn hard

Classic article “6 rules of thumb for MongoDB schema design “ well worth reading, and features like Schema Explorer in the third-party tool Studio 3T, it is worth using for regular circuit checks.

Don’t forget the sort order

Forgetting sort order can be the most frustrating and wasteful of any other misconfiguration. By default MongoBD uses binary sort… But it is unlikely that it will be useful to anyone. Case-sensitive, stress-sensitive, binary sorts were considered curious anachronisms, along with beads, caftans and curly mustaches, back in the 1980s. Now their use is unforgivable. In real life “motorcycle” is the same as “motorcycle”. And “Britain” and “Britain” are one and the same place. A lowercase letter is simply the uppercase equivalent of a capital letter. And don’t make me talk about diacritical sorting. When creating a database in MongoDB use non-accented collation and registerthat match the language and culture of system users… This will greatly simplify your search for string data.

Building Collections with Large Documents

MongoDB is happy to host large documents up to 16MB in collections, and GridFS designed for large documents larger than 16 MB. But just because large documents can be placed there, it is not a good idea to keep them there. MongoDB works best if you save individual documents several kilobytes in size, treating them more like rows in a wide SQL table. Large documents will be a source of problems with productivity

Create documents with large arrays

Documents can contain arrays. It is best if the number of elements in the array is far from the four-digit number. If elements are added to the array frequently, it will outgrow the containing document, and will need to move, so it will be necessary update and indexes… When re-indexing a document with a large array, the indices will often be overwritten, since each element has recordingstoring its index. This reindexing also occurs when a document is inserted or deleted.

MongoDB has a so called “Fill factor”which provides space for documents to grow to minimize this problem.
You might think that you can do without indexing the arrays. Unfortunately, due to the lack of indexes, you may have other problems. Since documents are scanned from beginning to end, searching for elements at the end of the array will take longer, and most of the operations associated with such a document will be slow

Don’t forget that the order of the stages in the aggregation matters

In a query optimizer database system, the queries you write are explanations of what you want to get, not how to get it. This mechanism works by analogy with ordering in a restaurant: usually you just order a dish, and do not give detailed instructions to the chef.

In MongoDB, you instruct the cook. For example, you need to make sure that the data goes through reduce as early as possible in the pipeline using $match and $project, and sorting occurs only after reduce, and that the search happens in exactly the order in which you want it. Having a query optimizer that eliminates unnecessary work, optimally organizes the stages, and selects the type of connection can spoil you. In MongoDB, you have more control at the cost of convenience.

Tools like Studio 3T simplify building aggregation queries in MongoDB… The Aggregation Editor allows you to apply pipeline statements one step at a time, as well as validate the input and output data at each step to simplify debugging.

Using quick recording

Never set MongoDB write parameters with high speed but low reliability. This mode “File-and-forget” seems to be fast because the command returns before writing is done. If the system crashes before the data is written to disk, it will be lost and in an inconsistent state. Fortunately, 64-bit MongoDB has logging enabled.

The storage engines MMAPv1 and WiredTiger use logging to prevent this, although WiredTiger can recover to the last negotiated control pointif logging is disabled.

Journaling ensures that the database is in a consistent state after recovery and retains all data until it is written to the journal. The frequency of entries is configured using the parameter commitIntervalMs

To be sure of the entries, make sure logging is enabled in the config file (storage.journal.enabled), and the frequency of entries corresponds to the amount of information that you can afford to lose.

Sorting without index

When searching and aggregating, it is often necessary to sort the data. Hopefully, this is done in one of the final stages, after filtering the result in order to reduce the amount of data being sorted. Even so, to sort, you need index… You can use a single or multiple index.

If there is no suitable index, MongoDB will do without it. There is a memory limit of 32 MB for the total size of all documents in sorting operationsand if MongoDB reaches this limit, then it will either throw an error or return empty recordset

Search without index support

Search queries perform a function similar to the JOIN operation in SQL. For the best performance, they need the index of the key value used as the foreign key. This is not obvious since the usage is not reflected in explain()… Such indexes are in addition to the index written in explain(), which in turn is used by pipeline operators $match and $sortwhen they meet at the beginning of the pipeline. Indexes can now cover any stage aggregation pipeline

Opt out of using multi-update

Method db.collection.update() used to change part of an existing document or a whole document, up to a complete replacement, depending on the parameter you specified update… It’s not so obvious that it won’t process all documents in the collection until you set the parameter multi to update all documents that meet the request criteria.

Don’t forget the importance of the order of the keys in the hash table

In JSON, an object consists of an unordered collection of zero or more name / value pairs, where name is a string and value is a string, number, boolean, zero, object, or array.

Unfortunately BSON places a lot of importance on order when searching. In MongoDB, order of keys within embedded objects has the meaning, i.e. { firstname: "Phil", surname: "factor" } Is not the same as { { surname: "factor", firstname: "Phil" }… That is, you must keep the order of the name / value pairs in your documents if you want to be sure you find them.

Do not confuse “Null” and “Undefined”

Value “Undefined” was never valid in JSON as per official standard JSON (ECMA-404, Section 5), despite being used in JavaScript. Moreover, for BSON it is deprecated and converts to $nullwhich is not always a good solution. Avoid using “Undefined” in MongoDB

Using $limit() without $sort()

Very often, when you’re developing in MongoDB, it’s helpful to just see a sample of the result that will return from a query or aggregation. For this task you will need $limit(), but it should never be in the final version of the code, unless you use before it $sort… This mechanic is necessary because otherwise you cannot guarantee the order of the result and you cannot reliably view the data. At the top of the result, you will get different records depending on the sort. To work reliably, queries and aggregations must be deterministic, that is, produce the same results every time. The code that contains $limit(), but no $sort, will not be deterministic and can subsequently cause errors that are difficult to track down.

Conclusion

The only way to get frustrated with MongoDB is to compare it directly to another type of database, such as a DBMS, or to come to using it based on some specific expectation. It’s like comparing an orange to a fork. Database systems have specific goals. It is best to simply understand and appreciate these differences for yourself. It would be a shame to put pressure on MongoDB developers because of the path that forced them to follow the DBMS path. I want to see new and exciting ways to solve old problems, such as ensuring data integrity and building data systems that are resilient to failure and malicious attacks.

MongoDB’s 4.0 implementation of ACID transactionality is a good example of how important improvements are being innovated. Multi-document and multi-statement transactions are now atomic. It also became possible to adjust the time it takes to acquire locks and complete hung transactions, as well as change the isolation level.

Read more:

  • How to upload data to Google BigQuery

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *