Sergei Tsarev: MongoDB vs. Clustrix Comparison: Part 2

Introduction

In Part 1 of my comparison, I ran some performance benchmarks to establish that relational systems can scale performance. In this post I would like to focus more on the High Availability and Fault Tolerance aspects of the two systems. The post will go over the approach of each system and what it means for fault tolerance and availability. I will also conduct a test of the claims: I'm going to fail a node by pulling power to see what happens.

A Primer on Clustrix Data Distribution

Clustrix has a fine grained approach to data distribution. The following graphic demonstrates the basic concepts and terminology used by our system. Notice that unlike MongoDB (and many other systems for that matter), Clustrix applies a per-index distribution strategy.

There are many interesting implications for query evaluation and execution in our model, and the topic deserves its own set of posts. For the curious, you can get a brief introduction to our evaluation model from our white paper on the subject. For this post, I'm going to stick to how our model applies to fault tolerance and availability.

You can find documentation for MongoDB's distribution approach on their website. In brief, MongoDB chooses a single distribution key for the collection. Indexes are co-located with the primary key shard.

Clustrix Fault Tolerance and Availability Demo

Fault Tolerance

Both Clustrix and MongoDB rely on replicas for fault tolerance. A loss of a node results in a loss of some copy of the data which we can find elsewhere in the system. The MongoDB team put together a good set of documentation describing their replication model. Perhaps one of the most salient differences between the two approaches is the granularity of data distribution. The unit of recovery on Clustrix is the replica (a small portion of an index), while the unit of recovery in MongoDB is a full instance of a Replication Set.

For Clustrix, this means that the reprotection operation happens in a many-to-many fashion. Several nodes copy small portions of data from each of their disks to several other nodes onto many disks. The advantages of this approach are:

No single disk in the system becomes overloaded with writes or reads
No single node hotspot for driving the reprotect work
Incremental progress toward full protection
Independent replica factors for each index (e.g. primary key 3x, indexes 2x)
Automatic reprotection which doesn't require operator intervention
All replicas are always consistent

One of the interesting aspects of the system is the complete automation of every recovery task. It's built in. I don't have to do anything to make that happen. So if I have a 10 node system, and a node fails, in an hour or so I will have a completely protected 9 node system without any operator intervention at all. When the 10th node comes back, the system will simply perceive a distribution imbalance and start moving data back onto that node.

While the Replica Sets feature in MongoDB is nicer than replication in say MySQL, it's still highly manual. So in contrast with the above list for Clustrix, for MongoDB we have:

Manual intervention to recover from failure
The data is moved in a one-to-one fashion
All data within a Replica Set has the same protection factor
Failures can lead to inconsistency

Availability

Both systems rely on having multiple copies of data for Availability. I've seen a lot of interesting discussion recently about the CAP theorem and what it means for real-world distributed database systems. It's another deep topic which really deserves its own set of posts, so I'll simply link to a couple posts on the subject which I find interesting and illuminating:

At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.

However, Clustrix also offers a lower level of consistency in the way of asynchronous replication between clusters. So if you want to setup a disaster recovery target in another physical location over high-latency link, we're able to accommodate that mode. It simply means that your backup cluster may be out of date by some number of transactions.

MongoDB has relaxed consistency all around. The Replication Set itself uses an asynchronous replication model. The MongoDB guys are upfront about the kinds of anomalies they expose. The end user gets the equivalent of read uncommitted isolation. Mongo's claim is that they do this because they (1) can achieve higher performance, and (2) "merging back old operations later, after another node has accepted writes, is a hard problem." Yes. Distributed protocols are a hard problem, but it doesn't mean you should punt on them.

Availability Continued

There's also a more nuanced discussion to availability. One of the principal design features of Clustrix has been to aim for lock-free operation whenever possible. We have Multi-Version Concurrency Control (MVCC) deeply ingrained in the system. It allows a transaction to see a consistent snapshot of the database without interfering with writes. So a read in our system will not block a write.

Building on top of MVCC, Clustrix has implemented a transactionally safe, lockless, and fully consistent method for moving data in the cluster without blocking any writes to that data. All of this happens completely automatically. No administrator intervention required. So when the Rebalancer decides to move a replica from Node 1 to Node 3, the replica can continue to take writes. We have a mechanism to sync changes to the source replica with the target replica without limiting the replica availability.

Compare that to what many other systems do: read lock the source to get a consistent view for a replica copy. You end up locking out writers for the duration of the data copy. So while your data is available for reads, it is not available for writes.

After a node failure (or to be more precise, a replica failure within a set), MongoDB advocates the following approach:

Quiesce the master (read lock)
Flush dirty buffers to disk (fsync)
Take an LVM snapshot of the resulting files
Unlock the master
Move the data files over to the slave
Let the slave catch up from the snapshot

So a couple of points (a) the MongoDB is not available for writes during steps (1) and (2), and (b) it's a highly manual process. It reminds me very much of the MySQL best practices for setting up a slave.

Conclusion

I've seen a lot of heated debates about Consistency, Availability, Performance, and Fault Tolerance. These issues are deeply interconnected and it's difficult to write about any of them in isolation. Clustrix maintains a high level of performance without sacrificing consistency and very high degree of availability. I know that it's possible to build such a system because we actually built it. And you shouldn't sacrifice these features in your application because you believe it's the only way to achieve good performance.

10 comments:

AnonymousFebruary 2, 2011 at 10:11 AM
Your comments about replica sets are completely wrong. All failover is 100% automated. So your 6 step process is actually a 0 step process.
The BreezeFebruary 2, 2011 at 10:33 AM
I think your analysis of those two cluster availability solutions is good, (unlike Sammys511) but I think there are other much better options for fault tolerance (if you can even call a cluster a fault tolerant option). Two servers with lockstep middleware and a built-into-the-hardware reporting/isolation/resolution service model do a better job with less hassle and IT oversight.
Just to be transparent I work on Stratus Technologies ftservers- which is how I found your post.
KaiFebruary 2, 2011 at 10:53 AM
It seems to me that Clustrix has chosen Consistency and Partition Tolerance from CAP. Consistency because writes are guaranteed to read back the same and Partition Tolerance because it can still operate as long as there is a quorum of replicas. On the other hand it does not provide Availability in the CAP sense, once a quorum is lost the system does not function properly. Is this a correct assessment in your opinion?
Kyle BankerFebruary 2, 2011 at 12:09 PM
Just wanted to correct a few possible misunderstandings:

You assert that failover with MongoDB replica sets is manual. This isn't true. Replica sets have been designed to automate failover. You can configure a replica set in a matter of minutes and verify this if you like.

You state that "MongoDB has relaxed consistency all around." This is also untrue. MongoDB supports strongly consistent reads. If you're only communicating with a replica set primary, you're guaranteed to be consistent. If you plan on reading from secondary nodes as well, you have a few options. You might allow for inconsistent reads. Or, if you need strong consistency all around, then when writing to a replica set, you can alter the desired consistency level by specifying the minimum number of nodes to replicate to before returning.

For future comparisons, please feel free to send any questions or drafts to the MongoDB user list (http://groups.google.com/group/mongodb-user), where we'd be happy to clarify any points.
SergeiFebruary 2, 2011 at 12:16 PM
Kyle, you said the same thing that Meghan said on the Hacker News article. And it has the same flaw. Yes, there's automatic failover, but the recovery is *completely manual*.

I'll just post the link to my response there:
http://news.ycombinator.com/item?id=2171663

For consistency, what happens when I read an update that succeeds on the master but later fails on the slaves?
DanFebruary 2, 2011 at 12:20 PM
How do we download Clustrix to try it out? Do you have to run it on your specialized hardware? Do you have any plans to compare your product to Oracle? Comparing Clustrix to Oracle seems like a more worthwhile endeavor since it shares the same data model as Clustrix. People (mostly) do not just choose a NoSQL database because it is hip and trendy, they do because the data model that the particular NoSQL database has fits their problem the best. That is why there are so many NoSQL databases and why no one ever agrees which one is the best. There is no best one. There are different ones. Every problem has a different solution and you have to weigh the pros and cons of each DB, whether it is a NoSQL DB or a SQL DB, to determine which one suits your needs. I feel that comparing Clustrix to MongoDB is useless since they do not share the same data model. And if you continue down this path, you are going to have to compare yourself to every other NoSQL database since they all are meant to solve different problems.
SergeiFebruary 2, 2011 at 12:28 PM
The data model really has no relevance to this discussion. It's a straw man argument. No where in my post did I even mention the query language or model.

Both Clustrix and MongoDB claim to have a Fault Tolerant and Highly Available distributed databases. I wrote up a survey of the two approaches.
DanFebruary 2, 2011 at 12:35 PM
I was commenting on the entire series of posts comparing Clustrix to MongoDB. I figured I would comment on the most recent post as opposed to commenting on the previous post again. Thanks for answering my questions about how we can get Clustrix to try out.
davetFebruary 7, 2011 at 2:50 PM
Can you elaborate more on partitioning by index level? I don't quite see how it is different than, say MongoDB. Maybe I just need more detail.
SergeiFebruary 8, 2011 at 1:17 PM
Let's say you create a collection in mongo with fields a and b. You choose "a" as the shard key, but you also want an index over "b." In mongo, every shard is going to have its own index over b for all the values of a on that shard. For example, (1,1), (1,2) are on shard X and (2,3) and (2,4) are on shard Y. So shard X has an index containing (1,2) and shard Y has an index containing (3,4).

If you want to search only by "b" without including a constraint on "a", then you end up broadcasting the request to all the shards because any one of them can contain the value you are looking for.

We call this a colocated index (i.e. it's located with the distribution/shard key).

In Clustrix, the two indexes are independently placed. So when you have a query which constrains only by "b", we know which nodes/slices can contain the values of "b" that we're looking for. We can avoid broadcasts in many important cases.

Tuesday, February 1, 2011

MongoDB vs. Clustrix Comparison: Part 2

10 comments: