Sergei Tsarev: 2011

Friday, November 4, 2011

Distributed Database Architectures: Distributed Storage / Centralized Compute

In my previous post I wrote about shared disk architectures and the problems they introduce. It's common to see comparisons between shared disk and shared nothing architectures, but that distinction is too coarse to capture the differences between various shared nothing approaches.

Instead, I'm going to characterize the various "shared-nothing" style systems by their query evaluation architectures. Most systems fall into one of the following buckets:

Centralized compute
Limited distributed compute
Fully distributed compute

Centralized Compute: MySQL Cluster

MySQL Cluster consists of two basic roles used for servicing user queries: a compute role and a storage/data role. The compute node is the front end which takes in the query, plans it, and executes it. The compute node will communicate with the storage nodes remotely to fetch any data relevant to the query.

In the distributed storage model, data is no longer shared between the nodes at the page level. Instead, the storage nodes expose a higher level API which allows the compute node to fetch row ranges based on the available access paths (i.e. indexes).

In such a system, storage level locks associated with the data are now managed exclusively by the storage node itself. A compute node does not cache any data; instead, it always asks the set of storage nodes responsible for the data. The system solved the cache coherence overhead problem.
However, it still suffers from extensive data movement and centralized query evaluation.

~~Cache coherence overhead~~
Extensive data movement
Centralized query evaluation

MySQL Query Evaluation in Action

Consider the following example:

SELECT count(*) FROM mytable WHERE acol = 1 and bcol = 2

Assumptions:

an index over acol
10% of the rows in the table match acol = 1
3% of the rows match acol = 1 and bcol = 2
total table size 1 Billion rows

In the diagram above, the arrows represent the flow of data through the system. As you can see, most of the query evaluation in the example is done by a single compute node. The system generated a data movement of 100 million rows, and only a single node performed additional filtering and aggregate count.

It's an improvement over a shared disk system, but it still has some serious limitations. Such a system could be well suited for simple key access (i.e. query touches a few specific rows), but any more complexity will generally result in poor performance

As with the shared disk system, adding more nodes will not help improve single query execution, and queries which operate over large volumes of data have the potential to saturate the message bus between the nodes.

Distributed Database Architectures: Shared Disk

One of the most common questions I get about Clustrix is "How is Clustrix different than Database X?" Depending on who asks the question, Database X tends to be anything from Oracle RAC to MySQL Cluster. So I decided to put together a primer on different types of Distributed Database Architectures, and what each one means for real world applications.

Shared Disk Databases

Shared disk databases fall into a general category where multiple database instances share some physical storage resource. With a shared disk architecture, multiple nodes coordinate access to a shared storage system at a block level.

Adding clustering to a stand alone database through a shared disk cache.

Several of the older generation databases fall into this category, including Oracle RAC, IBM DB2 pureScale, Sybase, and others.

These systems all started out as single instance databases. The easiest way to add clustering to an existing database stack would be to share the storage system between multiple independent nodes. All of these databases already did paged disk access through a buffer manager cache. Make the caches talk to each to manage concurrence, and bam, you have a distributed database!

Most of the database stack remains the same. You have the same planner, the same optimizer, the same query execution engine. It's a low risk way to extend the database to multiple nodes. You can add more processing power to the database, keep data access transparency in the application layer, and get some amount of fault tolerance.

But such systems also have serious limitations which prevent them from getting very wide spread adoption, especially for applications which require scale. They simply don't scale for most workloads (footnote: see Scale w/ Sharing below), and they are extremely complex to administer.

Almost every new distributed database system built in the last 10 years has embraced a different architecture, mainly because the shared disk model has the following problems:

Cache coherence overhead
Extensive data movement across the cluster
Centralized query execution (only a single node participates in query resolution)

Cache Coherence and Page Contention

Let's assume the following workload as the first example:

DB 1: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = X

DB 2: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = Y

Now let's further assume that some of the keys for that update statement will share a page on disk. So we don't have contention on the same key, but we do have some contention on the same disk page.

Page contention in shared-disk systems.

Both nodes receive similar update queries. Both nodes must update the same physical page. But in order to do it safely, they must insure that all copies of the page are consistent across the cluster.

Managing such consistency comes at a cost. Consider some of the requirements which have to be satisfied:

Every node must acquire a page lock in order to read or write from the page. In a clustered environment, such locking results in communication between nodes.
If a node acquires a write lock on a page and modifies it, it must notify any other node to invalidate their caches.
If a node gets a page invalidation request, it must re-fetch the latest copy of the page, which also results in more network communication.
Each node caches the contents of the entire data set. It means that the effective cache size of the cluster is as large as the cache on any of the nodes. Adding more nodes does not scale cache efficiency.

Now imagine adding more nodes to your cluster. You end up with a system which has non-linear scaling in message complexity and data movement. It's common to see such systems struggle beyond 4 nodes on typical OLTP workloads.

Data Movement

Consider another example which poses problems for shared disk systems:

SELECT sum(mycol) FROM mytable WHERE something = x

Let's assume that mytable contains 1 Billion rows and that the something = x leaves us 1,000 rows. With a shared disk system, if the predicate results in a table scan, the entire contents of the 1B row table must be transferred to the node evaluating the query!

And from the previous example, we see that it's not just a matter of data movement across the SAN infrastructure. The system must also maintain cache coherence, which means lots of cache management traffic between all the nodes. Such queries on large data sets can bring the whole cluster to its knees.

Centralized Query Resolution

It's also worth while to point out that only a single node within the cluster can participate in query evaluation. So even if you throw more nodes at your system, it's not generally going to help you speed up that slow query. In fact, adding more nodes may slow down the system as the cost of managing cache coherence increases in a larger cluster.

Scaling with Sharding

It's possible to scale with shared disk, but it requires an extensive engineering effort. The application is written (a) to keep affinity of data accesses to nodes and (b) to keep each shard within its own data silo. You lose transparency of data access at the application level. The app can no longer send any query to any one of the nodes, or it will result in a catastrophic performance problem.

While you end up with a very expensive means of implementing sharding, there is some advantage to this approach over regular sharding. In case of node failure, you can more easily bring in a host standby without having to keep a dedicated slave for each shard. But in practice, getting fault tolerance right with a SAN based shared disk infrastructure is no easy task -- just ask anyone who manages SANs for a living.

Monday, February 28, 2011

Profile Driven Performance Optimization

Recently, one of our support engineers noticed that a customer cluster got a little sluggish during a routine maintenance operation. In looking back at historical data, we noticed that the cluster saw a decrease in transaction throughput. The system should have done its cleanup in the background at a low priority, but that didn't happen. I thought it would be interesting to describe our process for addressing a performance problem like this one. I will also share the tools we developed and used.

Reproduce, Measure, and Automate

The symptom our support engineer noticed was a "sluggish" system. More accurately, we saw that query latencies increased and overall system throughput dropped. So our first step was to reproduce the problem under some approximation of the customer's workload. It was a bit tricky to get all of the prerequisite conditions just right, but we finally landed on the right set of circumstances.

Once we reproduced the issue, the next step was to automate the test to ensure that we could consistently demonstrate the problem. It boiled down to using one of our standard performance harnesses with a couple of tweaks.

The Profile

Our most commonly used profiling tool is based on the Linux oprofile infrastructure. But we had to modify it to better suit our system. Why? Because substantial portions of our codebase are written in an event driven/continuation passing style programming model. Call graph reporting in ofprofile is highly dependent on the stack, and the C stack often does not contain enough information to provide a valuable analysis in our system.

To get around this problem, we modified oprofile to gather additional information from the system. The key concept was adding a method for tagging a series of continuation calls and getting oprofile to sample these tags. The tag itself includes enough information to map the executing code to a logical system module hierarchy. The screen shot below shows an example output from one of our reporting tools.

Coming back to our original problem, the screen shot shows the output of the profile analysis tool during the dip in performance. The tool allows us to isolate the report to a specific processor. In this case we're looking at the output for our management core.

Interpreting the results we see that:

The execution engine dominates performance, and most of that is in the storage engine
The storage engine spends 63% of all its cycles on CRC32 calculations
Our particular maintenance task was erroneously bound to the management core
The task should have a lower priority

CRC Performance

The root cause of the performance degradation was a scheduling issue. However, it caused me to take a closer look at our crc algorithm. We use the crc implementation from zlib, and it should be reasonably well optimized. Whipping up a quick benchmark, I we see that the zlib implementation takes about 41us to checksum a 32k block. In the test, we were reading about 224MB/s per node. That's 293ms of checksums in a second. No wonder we saw a performance drop.

I looked around and found a paper from Intel describing their slice-by-eight crc approach. A quick benchmark revealed that the same 32k checksum could be done in 27us -- a 37% percent improvement over zlib's version.

With Nehalem-based processors, Intel introduced an SSE instruction for a hardware based crc computation. I haven't had a chance to conduct my own benchmarks, but the following results suggest that we could get a 2-3x speedup over slice by eight.

Source Code

I'm releasing the source for the tools referenced in this post. The user space oprofile daemon replacement requires our database runtime which I am not making public.

Monday, February 7, 2011

Clustrix as a Document Store: Blending SQL and JSON Documents

Many responses to my previous post claimed that my comparisons to MongoDB were unfair because MongoDB is a "document store" and Clustrix is a SQL RDBMS. Somehow that distinction made almost any comparison invalid. In response, I spent my weekend coding up a prototype document store interface for Clustrix to demonstrate that a different data model is not in itself an architectural differentiator.

At first I thought of just creating a different front end for Clustrix; our architectural model makes this easy. However, I quickly decided against it because in many ways, it would be much more limiting than a SQL based interface (e.g. joins, flexibile aggregation, subqueries, etc.)

Instead, I extended our SQL syntax to support native operations on JSON objects. So now you can do the following:

clustrix> create table files (id int primary key auto_increment, doc json);
Query OK, 0 rows affected (0.04 sec)

clustrix> insert into files (doc) values ('{"foo": {"bar": 1}, "baz": [1,2,3,4]}');
Query OK, 1 row affected (0.00 sec)

clustrix> select id, files.doc::foo.baz from files where files.doc::foo.bar = 1;
+----+--------------------+
| id | files.doc::foo.baz |
+----+--------------------+
| 1 | [1,2,3,4] |
+----+--------------------+

1 row in set (0.00 sec)

clustrix> create index foo_bar on files (doc::foo.bar);

Query OK, 0 rows affected (0.08 sec)

The database has native support for dealing with JSON documents. We're not simply storing text blobs inside of some column and getting them back. We're exposing the contents of the JSON document to the underlying planner and execution engine. Immediately, we get the following advantages:

Ability to do joins across document collections
Extremely powerful and flexible query language
Ability to index into JSON objects
Transactional semantics built-in

Taking it alittle further, I added a select clause modifier which instructs the database to return all row data as a JSON document, including fields which come from "a relational column." The following example shows how the database can seamlessly join between our json data type and other data types in the system, and then return the result as a JSON objects.

clustrix> select _json_ f.doc, u.doc::username

    ->   from files f

    ->   join users u on f.doc::user_id = u.user_id

    ->  where f.doc::foo.bar = 1\G

*************************** 1. row ***************************

json: {"f.doc": {"foo": {"bar": 1}, "baz": [1,2,3,4]}, "u.doc::username": "sergei"}

1 row in set (0.01 sec)

We can continue to extend the syntax. For example, adding operators to manipulate lists within JSON. Or adding optional schema checking for contents of the JSON (i.e. something along the lines of DTD for XML). I'm sure you can think of more.

An any case, one can build a system which combines the best characteristics of a document store with the power of SQL. Both models can coexist in the same database, allowing the devloper to trully choose the data model which best suits his or her needs.

Wednesday, February 2, 2011

Sierra and the Clustrix Database Stack

A lot of the responses to my previous posts criticized my choice of comparing a document database like MongoDB to a relational database like Clustrix. I tried to examine aspects of the database architecture which have nothing to do with the data model, but somehow the data model would always come up. There is a set of concerns that's common to all database systems, whether you are a document store or a relational database.

But first, think about how your database would handle the following workload. Chose whatever data model you find best suited for my use case.

get me all the records from foo where a = ? and b = ?

foo has an index over A and another index over B.
A and B have non-uniform data distributions. A is 90% value "X" and B is 90% value "Y".
We have 1 billion records in foo.
The database has a choice: index A, index B, or scan and filter.
50% of the queries are a = X and b = Z, the other 50% are a = Z and b = Y

What does your database do?

If the database always chooses index A or index B for all queries, then it ends up examining 900,000 rows 50% of the time.

We'll come back to my question later. First, I will briefly describe what the database stack in Clustrix looks like. I need to set a foundation so that my next set of posts makes sense.

To anyone who has experience with DBMS systems, the stack should look very familiar. We have fairly strict abstraction of interfaces between the various portions of the stack.

The Protocol Handler and Query Parser are responsible for taking in user connections and translating SQL into our intermediate representation called Sierra. We can actually support multiple type of dialects at these two layers; the constraint is that we must be able to express the query language in Sierra. And Sierra is much more expressive than SQL. The value of Sierra is that it provides an extensive planner framework for reasoning about distributed database queries.

So the Planner/Optimizer only accepts Sierra. It runs through a search space of possible plans and prunes them based on cost estimates. After coming up with the best plan candidate, it translates into another intermediate representation used by the Distributed Compiler, which reasons out the physical execution plan for our query. Finally, we compile the query into machine code and execute it.

As a performance optimization, we cache the compiled programs and plans. Clustrix does not need to optimize and compile every query, and you don't need to use prepared statements to get this behavior.

Back to my question. What did you come up with? Well, I can tell you what happens in Clustrix. During the planning phase of Sierra, we examine the various statistics that the database keeps about the data distribution in our indexes. The statistics include tracking:

Number of distinct values across index columns
Quantile distributions
Hotlist tracking top n values within a column

Clustrix correctly chooses the plan which will result in the least number of rows examined for all input parameters.

Now, don't get me wrong. I am not claiming that data distribution statistics are unique to Clustrix. On the contrary, they are very common in any modern RDBMS. I'm using it as an example of a requirement that's independent of any data model, and it's actually a very important feature to have.

Tuesday, February 1, 2011

MongoDB vs. Clustrix Comparison: Part 2

Introduction

In Part 1 of my comparison, I ran some performance benchmarks to establish that relational systems can scale performance. In this post I would like to focus more on the High Availability and Fault Tolerance aspects of the two systems. The post will go over the approach of each system and what it means for fault tolerance and availability. I will also conduct a test of the claims: I'm going to fail a node by pulling power to see what happens.

A Primer on Clustrix Data Distribution

Clustrix has a fine grained approach to data distribution. The following graphic demonstrates the basic concepts and terminology used by our system. Notice that unlike MongoDB (and many other systems for that matter), Clustrix applies a per-index distribution strategy.

There are many interesting implications for query evaluation and execution in our model, and the topic deserves its own set of posts. For the curious, you can get a brief introduction to our evaluation model from our white paper on the subject. For this post, I'm going to stick to how our model applies to fault tolerance and availability.

You can find documentation for MongoDB's distribution approach on their website. In brief, MongoDB chooses a single distribution key for the collection. Indexes are co-located with the primary key shard.

Clustrix Fault Tolerance and Availability Demo

Fault Tolerance

Both Clustrix and MongoDB rely on replicas for fault tolerance. A loss of a node results in a loss of some copy of the data which we can find elsewhere in the system. The MongoDB team put together a good set of documentation describing their replication model. Perhaps one of the most salient differences between the two approaches is the granularity of data distribution. The unit of recovery on Clustrix is the replica (a small portion of an index), while the unit of recovery in MongoDB is a full instance of a Replication Set.

For Clustrix, this means that the reprotection operation happens in a many-to-many fashion. Several nodes copy small portions of data from each of their disks to several other nodes onto many disks. The advantages of this approach are:

No single disk in the system becomes overloaded with writes or reads
No single node hotspot for driving the reprotect work
Incremental progress toward full protection
Independent replica factors for each index (e.g. primary key 3x, indexes 2x)
Automatic reprotection which doesn't require operator intervention
All replicas are always consistent

One of the interesting aspects of the system is the complete automation of every recovery task. It's built in. I don't have to do anything to make that happen. So if I have a 10 node system, and a node fails, in an hour or so I will have a completely protected 9 node system without any operator intervention at all. When the 10th node comes back, the system will simply perceive a distribution imbalance and start moving data back onto that node.

While the Replica Sets feature in MongoDB is nicer than replication in say MySQL, it's still highly manual. So in contrast with the above list for Clustrix, for MongoDB we have:

Manual intervention to recover from failure
The data is moved in a one-to-one fashion
All data within a Replica Set has the same protection factor
Failures can lead to inconsistency

Availability

Both systems rely on having multiple copies of data for Availability. I've seen a lot of interesting discussion recently about the CAP theorem and what it means for real-world distributed database systems. It's another deep topic which really deserves its own set of posts, so I'll simply link to a couple posts on the subject which I find interesting and illuminating:

At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.

However, Clustrix also offers a lower level of consistency in the way of asynchronous replication between clusters. So if you want to setup a disaster recovery target in another physical location over high-latency link, we're able to accommodate that mode. It simply means that your backup cluster may be out of date by some number of transactions.

MongoDB has relaxed consistency all around. The Replication Set itself uses an asynchronous replication model. The MongoDB guys are upfront about the kinds of anomalies they expose. The end user gets the equivalent of read uncommitted isolation. Mongo's claim is that they do this because they (1) can achieve higher performance, and (2) "merging back old operations later, after another node has accepted writes, is a hard problem." Yes. Distributed protocols are a hard problem, but it doesn't mean you should punt on them.

Availability Continued

There's also a more nuanced discussion to availability. One of the principal design features of Clustrix has been to aim for lock-free operation whenever possible. We have Multi-Version Concurrency Control (MVCC) deeply ingrained in the system. It allows a transaction to see a consistent snapshot of the database without interfering with writes. So a read in our system will not block a write.

Building on top of MVCC, Clustrix has implemented a transactionally safe, lockless, and fully consistent method for moving data in the cluster without blocking any writes to that data. All of this happens completely automatically. No administrator intervention required. So when the Rebalancer decides to move a replica from Node 1 to Node 3, the replica can continue to take writes. We have a mechanism to sync changes to the source replica with the target replica without limiting the replica availability.

Compare that to what many other systems do: read lock the source to get a consistent view for a replica copy. You end up locking out writers for the duration of the data copy. So while your data is available for reads, it is not available for writes.

After a node failure (or to be more precise, a replica failure within a set), MongoDB advocates the following approach:

Quiesce the master (read lock)
Flush dirty buffers to disk (fsync)
Take an LVM snapshot of the resulting files
Unlock the master
Move the data files over to the slave
Let the slave catch up from the snapshot

So a couple of points (a) the MongoDB is not available for writes during steps (1) and (2), and (b) it's a highly manual process. It reminds me very much of the MySQL best practices for setting up a slave.

Conclusion

I've seen a lot of heated debates about Consistency, Availability, Performance, and Fault Tolerance. These issues are deeply interconnected and it's difficult to write about any of them in isolation. Clustrix maintains a high level of performance without sacrificing consistency and very high degree of availability. I know that it's possible to build such a system because we actually built it. And you shouldn't sacrifice these features in your application because you believe it's the only way to achieve good performance.

Sunday, January 30, 2011

MongoDB vs. Clustrix Comparison: Part 1 -- Performance

UPDATE:

You can now download the benchmark code. As mentioned in my post, I populated both databases with 300M rows. I played around with the best client/host/thread ratio for each database to achieve peak throughput. I used MongoDB v1.6.5.

For the MongoDB read tests, I used the read-test.cpp harness. I didn't have time to do a proper getopt parsing for it, so I would modify it by hand for test runs. But it's very straight forward.

The C version of the MySQL/Clustrix harness is not included because I used a Clustrix internal test harness framework -- to save time. It doesn't impart Clustrix any advantage in the test, and it relies too much on our infrastructure to be of any value. You can still use the Python based test harness -- it just requires a lot of client cpu power.

UPDATE 2:

There's an interesting conversation over at Hacker News about this post.

Introduction

With all the recent buzz about NoSQL and non-relational databases, the marketing folks at Clustrix asked a question: Do we have the right solution for today's market? It's a fair question, especially since Clustrix is a fully featured RDBMS with a SQL interface. And we all heard that SQL doesn't scale, right?

So that brings us around to the next question: What do people actually want out of their database? Surely it's not simply the absence of a SQL based interface. Because if that's the case, Berkeley DB would be a lot more popular than say SQLite. Over the years, we've had many conversations with people about their database needs. Over and over, the following has always come up a list of must have features for any modern system:

Incrementally Scalable Performance
High Availability and Fault Tolerance
Ease of Overall System Management

Interestingly enough, we never heard that SQL or the relational model was the root of all their problems. It appears that the anti-SQL sentiment came around with this sort of false reasoning.

I have a SQL based RDBMS.
I can't seem to scale my database beyond a single system.
Therefore SQL is the problem.

The NoSQL movement embraced this reasoning. The NoSQL proponents began to promote all sorts of "scalable" systems at the expense of venerable DBMS features like durability. And they kept going. What else don't we need? Well, we don't need Consistency! Why? Because that's really hard to do and keep performance. Slowly but surely, these systems would claim to have a panacea for all of your scalable database needs at the expense of cutting features we've come to expect from 40 years of database systems design.

Well, that's just bullshit. There is absolutely nothing about SQL or the relational model preventing it from scaling out.

Over the next set of posts, I'm going to compare MongoDB and Clustrix using the above evaluation criteria: Scalable Performance, Availability and Fault Tolerance, and Ease of Use. I am going to start with Performance because no one believes that you can grow a relational database to Internet Scale. And to put the results into context, I chose to compare Clustrix to MongoDB because (1) it doesn't support SQL, (2) it can transparently scale to multiple nodes, and (3) it seems to be the new poster child for NoSQL.

Performance

Conducting performance benchmarks is always challenging. First, you have to decide on a model workload. Next, you have to accurately simulate that workload. The system under test should be as close as possible to your production environment. The list gets long. In the end, no benchmark is going to be perfect. Best you can hope for is reasonable.

So I looked at some common themes in the workloads from some of our customers, and decided that I would simulate a basic use case of keeping metadata about a collection of 1 Billion files. Whether you're a cloud based file storage provider or a photo sharing site, the use case is familiar. The test would use the appropriate access patterns for the database. Since MongoDB does not support joins, I'm not going to put it at a disadvantage by moving join logic into the application. Instead, I'm going to make full use of the native document centric interface.

Benchmark Overview

10 node cluster of dedicated hosts (SSD, 8 cores/node, 32GB ram/node)
2x replication factor
A data set containing information about ~~1 Billion files~~ 300 Million (see bellow)
A read only performance test
A read/write performance test
Both databases will use the exact same hardware

The test uses the following schema:

CREATE TABLE files (

    id                            bigint     NOT NULL,

    path                         varchar    NOT NULL,

    size                         bigint     NOT NULL,

    server_id               int        NOT NULL,

    deleted                   smallint   NOT NULL,

    last_updated         datetime   NOT NULL,

    created                   datetime   NOT NULL,

    user_id                   bigint     NOT NULL,

    PRIMARY KEY (id),

    KEY server_id_deleted (server_id, deleted),

    KEY user_id_updated (user_id, last_updated),

    KEY user_id_path (user_id, path)

);

Additionally, the data set has the following characteristics:

path is a randomly generated string between the length of 32 and 128 characters
server_id has a distribution of 0-32
deleted has 1% value 1, and the rest 0 (lumpy data distributions tests)
for MongoDB, we use user_id as the shard key

Test 1: Loading the Initial Data Set

The test harness itself is a multi-host, multi-process, and multi-threaded python application. Because of the GIL in python, I ended up designing the test harness so that it forks off multiple processes, with each process having some number of threads. It also turns out that I needed more than 10 client machines to saturate the cluster with reads using python, so I rewrote the read tests suing C++ for MongoDB and C for Clustrix.

While populating the dataset into MongoDB, I kept on running into a huge drop off in performance at around 55-60M rows. A 10 node cluster has an aggregate of 320GB or ram and 80 160GB SSD drives. That's more than enough iron to handle that much data. As I started to dig in more, I saw that 85% of the data was distributed to a single node. MongoDB had split the data into multiple chunks, but its balancer could not (would not?) move the data to other nodes. Once the database size exceeded that node's available memory, everything went to shit. The box started thrashing pretty badly. It seems that under a constant high write load, MongoDB is unable to automatically redistribute data within the cluster.

To get the test going, I split the files collection into an even distribution. Without any load on the cluster, I watched MongoDB move the chunks onto the 10 replica sets for an even layout. Now I was finally getting somewhere.

Immediately, I noticed that MongoDB had a highly variable write throughput. I was also surprised at how low the numbers were. Which led me to discover Mongo's concurrency control: a single mutex over the database instance. Furthermore, in tracking the insert performance along with memory utilization, I could see that getting to more than 300 million records would spill some of the data set to disk. While that's a reasonable benchmark for a database, I decided that I would keep the data set memory resident for Mongo's sake.

The drop-off on the Clustrix happened because not all of the load scripts finished at the same time. A couple of the client nodes were slower, so they took a bit longer to finish up their portion of the load.

For a system which eschews consistency and durability, the write performance on MongoDB looks atrocious. Initially, I thought that Mongo completely trashing Clustrix on the write performance. The result was a complete surprise. Here's why I think Clustrix did so much better:

Highly concurrent b-tree implementation with fine grained locking
Buffer manager and transaction log tuned for SSD
Completely lock-free Split and Move operations (very cool stuff, another post)

Conversely, Mongo did so poorly because it:

Has a big fat lock, severely limiting concurrency
It relies entirely on the kernel buffer manager

Total time to load was 0:37 (hh:mm) on Clustrix and 4:47 on MongoDB.

Clustrix was 775% faster for writes than MongoDB!
And that's with fully durable and fully consistent writes on the Clustrix side.

Test 2: Read Only

The read test consists of the following basic workloads:

get the 10 latest updated files for a specific user
count the number of deleted files on a given server id

I chose these queries because they are representative of the types of queries our example application would generate, and they are not simple point selects. Getting a distributed hash table working is easy. But DHTs tend to fall apart fairly quickly when queries start introducing ordering, examining multiple rows, or other non key-value lookups. In other words, real-world use.

MongoDB
C++ test harness. Peak throughput at 64 concurrent client threads.

db.files.find({user_id: user_id}).sort({last_updated: -1}).limit(10)

55,103 queries/sec

db.files.find({'server_id': server_id, 'deleted': 1}).count()

675 queries/sec

Clustrix
C test harness. Peak throughput at 256 concurrent client threads.

select * from benchmark.files where user_id = .. order by last_updated desc limit 10

56,641 queries/sec

select count(1) from benchmark.files where server_id = .. and deleted = 1

625 queries/sec

So on a read-only test, MongoDB and Clustrix are within 1% of each other for test1. Clustrix is faster on test 1 and MongoDB is 7% faster on test2. I captured a profile of Clustrix during a test1 run, and saw that the the execution engine dominates CPU time (as opposed to say SQL parsing or query planning). In looking at the profiles during test2 runs on Clustrix, I saw that we had a bunch of idle time in the system, so there's room for optimization.

But real-world loads tend to be read/write, so let's see how Mongo does when we add writes to the equation.

Test 3: Read/Write

My initial plan called for a combination of read-centric and write-centric loads. It seems that most web infrastructures are heavier on the reads than writes, but there are many exceptions. In Clustrix, we use Multi-Version Concurrency Control, which means that readers are never blocked by writers. We handle both read heavy and write heavy workloads equally well. Since MongoDB seems to do much better with reads than writes, I decided to stick to a read-centric workload.

The Clustrix test shows show very little drop off in performance for reads. On the Mongo side, I expected to see a drop off in performance directly proportional to the amount of write load.

However, what I saw mind blowing: Mongo completely starved the readers! The following graph shows the query load on one of the 10 shards during the write portion of the test. I simply started up a single write thread while letting the read test run. The write was active for all of 60 seconds, and it took Mongo an additional 15 seconds to recover after the writer stopped.

MongoDB
FAIL

Custrix
test1: 4,425 writes/sec and 49,099 reads/sec (total 53,524 queries/sec) (92% read / 8% write)
test2: 4,450 writes/sec and 625 reads/sec

The test2 aggregate query is much more computationally expensive compared to an insert. So the read/write ratio for test2 became very skewed. Note that Clustrix did not drop in read throughput at all.

Overall, you can see why every modern DBMS choose to go with the MVCC model for concurrency control.

Conclusion

The SQL relational model can clearly scale. The only place where MongoDB could compete with Clustrix was on pure read-only workloads. But that's just not representative of real world application loads

Building a scalable distributed system is more about good architecture and solid engineering. Now that we have scale and performance out of the way, I'm going to review the other important aspects of a DBMS in my upcoming posts.