It looks like my post about MongoDB got a lot more popular than usual, and also provoked a sort-of official response from the MongoDB developer(s). It is fair to metion them together to allow people finding one part of the story to find the other. Since my original post talks about multiple issues and the comments wander through various topics I want to summarize the part of the discussion about durability here.
It is not hard to notice I'm biased toward "classical" databases - I don't mean SQL here, since I also love BerkeleyDB, but I consider the issue of reliability in databases "mostly solved" over the past decades. On the other hand I also understand that there are cases, valid usage cases, where it simply isn't important to go to all the trouble for various reasons, like "data not important" or "performance is #1 priority" or some variation of the like.
All of this is ok - as long as users understand the tradeoff - what is gained and what is lost. I felt that this choice was not presented adequatly to users of MongoDB. The current situation is a bit better, mostly because of their blog post referenced above, but still there is some confusion of issues and a bit of attempted solutions that look good but do nothing in practice (the "periodic fsync" feature). Users who really do understand that no part of MongoDB in any combination is durable in the ACID sense (or actually is about as durable as memcached) can and do use MongoDB effectively and there is no problem with that. These users also probably know all of what I will write in this post and can safely skip reading the rest of it - nothing new will be presented here. For the others, it might be informative to read on.
One of my interests is operating system design, and within this, file system design. It is really insightful to track the development of file systems which have both the properties of good performance and durability over time - and this ability to track and use the products of this evolution of technology is what attracted me to FreeBSD. All this is not a digression because of a one simple but often overlooked fact: file systems are databases of files, plain and simple.
File systems have been in use for a long time, and over time all of them attempted to address the contrary requirements of performance vs durability in some way. This development resulted in many ideas on how data can be kept safe with reasonable performance:
- Fully synchronous writes
- Partially synchronous writes
These solutions of course are used on some kind of hardware - it is impossible to implement any algorithm if the hardware doesn't support certin semantics. For the purpose of durable writes, the only "extra" thing the hardware needs to provide is the ability to, at one point in time, guarantee that the writes done previously are safely committed to storage. This is achievable on *all* non-broken hardware, from desktop-class hard drives to server RAID arrays - the only difference is what performance penalty is involved. Desktop drives will need to pause everything until the data is written, leading to really awful performance. Server systems usually have RAID controllers with battery-backed-up (BBU) memory which guarantees that the data will not disappear even if it is not written to the actual hard drives - and thus can offer much better performance by making the "SYNC" command a no-op. This is really all that the BBU units do - allow better performance with synchronous writes, there is no magic here. The file system durability methods enumerated above make use of this hardware guarantee in different ways.
Fully synchronous writes
As it says in the name, everything is written synchronously - every write command is followed by a sync command or something to that effect. This in itself is not enough, though. Writes for various bits and pieces of data (e.g. the index, the data, the metadata) need to be ordered in a way that enables durability; for example: data should be written to the file before its metadata is updated, because if metadata is updated first and the system crashes, it will point to garbage data, which is worse than having data written without any metadata pointing to it.
This is understandably the slowest method, usually available in all file systems.
Partially synchronous writes
This method is really a subtype of the "synchronous" methods, where the requirements are relaxed a bit so that, for example, metadata is still always written synchronously but data is written asynchronously (i.e. put in the write cache and written later when the system is idle, if applicable).
This is a compromise, and it's implemented as the BSD "nosync" mount method for UFS. It is excellent for example with databases, which create large files and do their own transaction management within the files, and the operating system has no metadata operations to write synchronously.
Logging adds a twist to the file system organization. Unlike all other methods that have a more-or-less rigid structure of the file system, with exactly specified on-disk locations for certain types of metadata, logging consideres the drive as a mostly sequential media and tries to write everything at the "current position" on the drive, avoiding expensive seeks.The exact semantics of durability here is usually similar to "fully synchronous" but the performance is better because eliminating seeks enormously improves performance on mechanical drives. The downside is that thare are few structures that are located at predefined places and the data from multiple files can be unpredictably scattered across the drive(s).
This method is implemented in the old BSD LFS, and in newer times, in ZFS.
This method usually combines logging with the "partially synchronous" method. Metadata operations are written to a smallish log, making them faster than usual, and the data is written directly to the rigid file system structure. When the machine is idle or when error recovery is required, the journalled operations are "replayed" and put to permanent storage in the rigid file system structure.
Because of the good performance vs durability vs convenience compromise, this method is used in most of todays general-purpose file systems, for example ext3, NTFS.
There is a subtype of journalling where both data and metadata are journalled (FreeBSD gjournal, ext3 data=journal mode), which is even easier to implement but requires a large journal (several GB, depending on drive speeds) and takes a performance penalty because all data is eventually written out twice.
Soft-updates is the clever oddball among the various methods. It attempts to do everything asynchrously (cached), but introduces rigid dependencies between various bits and pieces of data. For example, among a huge number of other things, it guarantees in software that metadata will never be written before data (i.e. so that it ends up pointing to garbage). One interesting consequence of this is that small, short-lived temporary files may never ever touch the physical drive(s) - if not manually synced by the software, data will never be written, so their metadata will never be commited and so, when deleted, the data structures will be deleted from memory and the parent metadata never updated.
In many ways this is the ideal combination - doesn't require a log, has strong algorithmic guarantees, and it's almost as fast as fully asynchronous writes, but it's so complex to implement that only one common file system ever has implemented it - the BSD UFS.
Putting it all together
The key insight here is that these techniques are absolutely the same whether it's the file systems writing to physical devices (drives), or applications writing to file systems, since both can issue a "sync" command guaranteeing data safety. All non-broken file systems which themselves implement some kind of a "smart" durability algorithm will, when issued a "sync" command forgo all their cleverness and make sure the data is written and synced to the drives before returning.
It is also important to notice that all of these methods are more than 10 years old, (soft-updates is the youngest one) and that the problem of guaranteeing consistency and durability on hard drives appears to be pretty much solved, requiring no deep thoughts about it.
What does all this have to do with MongoDB?
Depending on the users, it might have no connection with MongoDB at all. Users who don't require absolute durability will be perfectly fine.
What is more pertinent here is that the claims about single-server durability being overrated and hard to implement are, well, not exactly true. Replication is useful and can go a long way with some aspects of durability, but since it is also not implemented synchronously (ACID-like) in MongoDB, it's not a 100% solution - some operations can be lost. Also, in data cetres, problems like "water damage" and "fire" tend to affect all or most of the servers on a single location. Creating distributed replicated systems between distanced data centres is not exactly cheap nor popular (because of the latency if not other problems).
In the multi-server replication case, MongoDB as it currently is might as well power the next Facebook without major issues - users will not be very angry if their occasional picture of a cute cat or a status message is lost. Users on the other hand will get angry if their web shop transaction (i.e. where money is involved) is lost, their e-mail doesn't arrive or they get lost with on-line navigation.
At the end, it's about knowing the benefits and weaknesses of software used.