Have a great weekend everyone! We invite you to a free Demo lesson “Configuring a web server (Apache, Nginx, balancing Nginx)”held by Andrey Buranov, UNIX systems specialist at Mail.Ru Group. We also publish an article by Jonathan Corbet – Executive Editor at LWN.net.
Journaling file systems promise to relieve sysadmins of disk corruption issues during system crashes. Even without running a file system integrity check. Although in reality, of course, everything is a little more complicated. And as recent discussion suggests, it can be even more confusing than many of us think, since maintaining the integrity of journaled filesystems affects performance.
A file system like ext3 uses a separate area on disk called the journal. When changes are made to the file system metadata, those changes are first written to the log without modifying the rest of the file system. After all changes are written to the log, a “commit block” is added to the log to indicate the completion of the transaction. And only after writing the commit block, the transaction is committed, and the changed metadata is written to disk. If the system fails at some point, then using the information in the log, you can safely shut down and avoid damage to the file system due to the fact that only part of the metadata has been updated.
However, there is one catch: the file system code must be absolutely sure that all information about the transaction has already entered the log before writing the commit block. Simply writing the operations in the correct order is not enough – modern disks support large internal caches and reorder operations to improve performance. Therefore, before committing a block, you must explicitly indicate to transfer all log data to disk. If the commit block is written earlier, the log may be corrupted. Barriers are used to address this problem. Essentially, the barrier prohibits writing any blocks after the barrier until all blocks written before the barrier have been flushed to disk. By using barriers, file systems ensure the consistency of file structures.
But there is another problem: ext3 and ext4 file systems do not use barriers by default. The option is there, but if the administrator has not explicitly enabled them, then these filesystems work without barriers, although in some distributions (for example, SUSE) the default values are different. Eric Sandeen recently decided that this situation needs to be changed and made a patchmodifying the default settings for ext3 and ext4. And then a heated discussion began.
Andrew Morton in great detail repliedwhy the default is this:
The last time we tried to change this, performance on many workloads degraded by 30%, so I threw all these patches in horror. I think that we cannot go for it and slow down all machines so seriously …
There are no perfect solutions here, and I’m leaning towards not waking this sleeping dog and leaving the default options to the discretion of the distribution developers.
Thus, by default, barriers are disabled as they seriously impact performance. In addition, file systems are used quite successfully without barriers. Reports of ext3 filesystem corruption are few and far between.
But it’s not just luck. Ted Ts’o explains this is because the ext3 / ext4 journal is usually contiguous. First, the filesystem driver tries to make it continuous. Second, the journal is usually created at the same time as the file system when contiguous space is easy to find. Continuity and ordering are useful not only for performance, but also for preventing reordering. Typically, a commit block will be placed immediately after the rest of the log data, so there is no reason for the disk to reorder. A commit block is naturally written to disk immediately after the rest of the log entries.
However, no one claims that this will always be the case. Disk drives can behave differently. In addition, the log is a circular buffer. Therefore, when a transaction is written to the end of the log, the commit block may appear in an earlier block, before other log entries. So there is always a chance of damage. Actually, Chris Mason has a tests… There is no doubt that working without barriers is less safe than working with them.
If you’re ready to take a hit on performance, you can turn on barriers. In that case, of course, when your filesystem is not based on LVM (as in some distributions by default). It turns out device mapper doesn’t support barriers. Otherwise, it would be nice to reduce the performance degradation. And it looks like it can be done.
The current ext3 implementation (when barriers are enabled) performs the following sequence of operations for each transaction:
Data is logged
Barrier in progress
A commit block is written
Next barrier in progress
Later the metadata is flushed to disk
In ext4, the first barrier (step 2) can be omitted because the ext4 filesystem supports journal checksums.
If the log data and commit block are reordered, and the operation is aborted as a result of a failure, then the log checksum will not match the one stored in the commit block, and the transaction will be rejected.
Chris Mason believesthat it would be “generally safe” to remove this barrier in ext3 as well, with the possible exception of when the journal reaches the end and starts writing from the beginning.
Another idea to speed up your work is to postpone barrier operations whenever possible. If there is no urgent need to immediately flush data to disk, then you can create several transactions in the log, and flush to disk with one barrier.
There is also some room for improvement by carefully sequencing operations so that barriers (usually implemented as “flush all pending operations to disk” requests) do not force writes to blocks that do not need to be ordered.
It seems like the time has come to think about how to make the cost of the barriers acceptable. Ted Tso seems to be considers similarly:
I think we should enable barriers in ext3 / 4 and then work to reduce the overhead in ext4 / jbd2. Chances are, the vast majority of systems do not work under conditions similar to those used by Chris to demonstrate the problem, and the security of the default filesystem should be a priority.
Common sense tells me that this dog is no longer asleep and will probably bark for a while. This may disturb some neighbors, but it is better than letting her bite.
Is it interesting to develop in this direction? Sign up for a free Demo lesson “Configuring a web server (Apache, Nginx, balancing Nginx)” and participate in the broadcast “Working with logs in Linux”conducted by Pavel Vikiryuk – MVNO communications operator, DevOps engineer.