Sergei Tsarev: January 2011

UPDATE:

You can now download the benchmark code. As mentioned in my post, I populated both databases with 300M rows. I played around with the best client/host/thread ratio for each database to achieve peak throughput. I used MongoDB v1.6.5.

For the MongoDB read tests, I used the read-test.cpp harness. I didn't have time to do a proper getopt parsing for it, so I would modify it by hand for test runs. But it's very straight forward.

The C version of the MySQL/Clustrix harness is not included because I used a Clustrix internal test harness framework -- to save time. It doesn't impart Clustrix any advantage in the test, and it relies too much on our infrastructure to be of any value. You can still use the Python based test harness -- it just requires a lot of client cpu power.

UPDATE 2:

There's an interesting conversation over at Hacker News about this post.

Introduction

With all the recent buzz about NoSQL and non-relational databases, the marketing folks at Clustrix asked a question: Do we have the right solution for today's market? It's a fair question, especially since Clustrix is a fully featured RDBMS with a SQL interface. And we all heard that SQL doesn't scale, right?

So that brings us around to the next question: What do people actually want out of their database? Surely it's not simply the absence of a SQL based interface. Because if that's the case, Berkeley DB would be a lot more popular than say SQLite. Over the years, we've had many conversations with people about their database needs. Over and over, the following has always come up a list of must have features for any modern system:

Incrementally Scalable Performance
High Availability and Fault Tolerance
Ease of Overall System Management

Interestingly enough, we never heard that SQL or the relational model was the root of all their problems. It appears that the anti-SQL sentiment came around with this sort of false reasoning.

I have a SQL based RDBMS.
I can't seem to scale my database beyond a single system.
Therefore SQL is the problem.

The NoSQL movement embraced this reasoning. The NoSQL proponents began to promote all sorts of "scalable" systems at the expense of venerable DBMS features like durability. And they kept going. What else don't we need? Well, we don't need Consistency! Why? Because that's really hard to do and keep performance. Slowly but surely, these systems would claim to have a panacea for all of your scalable database needs at the expense of cutting features we've come to expect from 40 years of database systems design.

Well, that's just bullshit. There is absolutely nothing about SQL or the relational model preventing it from scaling out.

Over the next set of posts, I'm going to compare MongoDB and Clustrix using the above evaluation criteria: Scalable Performance, Availability and Fault Tolerance, and Ease of Use. I am going to start with Performance because no one believes that you can grow a relational database to Internet Scale. And to put the results into context, I chose to compare Clustrix to MongoDB because (1) it doesn't support SQL, (2) it can transparently scale to multiple nodes, and (3) it seems to be the new poster child for NoSQL.

Performance

Conducting performance benchmarks is always challenging. First, you have to decide on a model workload. Next, you have to accurately simulate that workload. The system under test should be as close as possible to your production environment. The list gets long. In the end, no benchmark is going to be perfect. Best you can hope for is reasonable.

So I looked at some common themes in the workloads from some of our customers, and decided that I would simulate a basic use case of keeping metadata about a collection of 1 Billion files. Whether you're a cloud based file storage provider or a photo sharing site, the use case is familiar. The test would use the appropriate access patterns for the database. Since MongoDB does not support joins, I'm not going to put it at a disadvantage by moving join logic into the application. Instead, I'm going to make full use of the native document centric interface.

Benchmark Overview

10 node cluster of dedicated hosts (SSD, 8 cores/node, 32GB ram/node)
2x replication factor
A data set containing information about ~~1 Billion files~~ 300 Million (see bellow)
A read only performance test
A read/write performance test
Both databases will use the exact same hardware

The test uses the following schema:

CREATE TABLE files (

    id                            bigint     NOT NULL,

    path                         varchar    NOT NULL,

    size                         bigint     NOT NULL,

    server_id               int        NOT NULL,

    deleted                   smallint   NOT NULL,

    last_updated         datetime   NOT NULL,

    created                   datetime   NOT NULL,

    user_id                   bigint     NOT NULL,

    PRIMARY KEY (id),

    KEY server_id_deleted (server_id, deleted),

    KEY user_id_updated (user_id, last_updated),

    KEY user_id_path (user_id, path)

);

Additionally, the data set has the following characteristics:

path is a randomly generated string between the length of 32 and 128 characters
server_id has a distribution of 0-32
deleted has 1% value 1, and the rest 0 (lumpy data distributions tests)
for MongoDB, we use user_id as the shard key

Test 1: Loading the Initial Data Set

The test harness itself is a multi-host, multi-process, and multi-threaded python application. Because of the GIL in python, I ended up designing the test harness so that it forks off multiple processes, with each process having some number of threads. It also turns out that I needed more than 10 client machines to saturate the cluster with reads using python, so I rewrote the read tests suing C++ for MongoDB and C for Clustrix.

While populating the dataset into MongoDB, I kept on running into a huge drop off in performance at around 55-60M rows. A 10 node cluster has an aggregate of 320GB or ram and 80 160GB SSD drives. That's more than enough iron to handle that much data. As I started to dig in more, I saw that 85% of the data was distributed to a single node. MongoDB had split the data into multiple chunks, but its balancer could not (would not?) move the data to other nodes. Once the database size exceeded that node's available memory, everything went to shit. The box started thrashing pretty badly. It seems that under a constant high write load, MongoDB is unable to automatically redistribute data within the cluster.

To get the test going, I split the files collection into an even distribution. Without any load on the cluster, I watched MongoDB move the chunks onto the 10 replica sets for an even layout. Now I was finally getting somewhere.

Immediately, I noticed that MongoDB had a highly variable write throughput. I was also surprised at how low the numbers were. Which led me to discover Mongo's concurrency control: a single mutex over the database instance. Furthermore, in tracking the insert performance along with memory utilization, I could see that getting to more than 300 million records would spill some of the data set to disk. While that's a reasonable benchmark for a database, I decided that I would keep the data set memory resident for Mongo's sake.

The drop-off on the Clustrix happened because not all of the load scripts finished at the same time. A couple of the client nodes were slower, so they took a bit longer to finish up their portion of the load.

For a system which eschews consistency and durability, the write performance on MongoDB looks atrocious. Initially, I thought that Mongo completely trashing Clustrix on the write performance. The result was a complete surprise. Here's why I think Clustrix did so much better:

Highly concurrent b-tree implementation with fine grained locking
Buffer manager and transaction log tuned for SSD
Completely lock-free Split and Move operations (very cool stuff, another post)

Conversely, Mongo did so poorly because it:

Has a big fat lock, severely limiting concurrency
It relies entirely on the kernel buffer manager

Total time to load was 0:37 (hh:mm) on Clustrix and 4:47 on MongoDB.

Clustrix was 775% faster for writes than MongoDB!
And that's with fully durable and fully consistent writes on the Clustrix side.

Test 2: Read Only

The read test consists of the following basic workloads:

get the 10 latest updated files for a specific user
count the number of deleted files on a given server id

I chose these queries because they are representative of the types of queries our example application would generate, and they are not simple point selects. Getting a distributed hash table working is easy. But DHTs tend to fall apart fairly quickly when queries start introducing ordering, examining multiple rows, or other non key-value lookups. In other words, real-world use.

MongoDB
C++ test harness. Peak throughput at 64 concurrent client threads.

db.files.find({user_id: user_id}).sort({last_updated: -1}).limit(10)

55,103 queries/sec

db.files.find({'server_id': server_id, 'deleted': 1}).count()

675 queries/sec

Clustrix
C test harness. Peak throughput at 256 concurrent client threads.

select * from benchmark.files where user_id = .. order by last_updated desc limit 10

56,641 queries/sec

select count(1) from benchmark.files where server_id = .. and deleted = 1

625 queries/sec

So on a read-only test, MongoDB and Clustrix are within 1% of each other for test1. Clustrix is faster on test 1 and MongoDB is 7% faster on test2. I captured a profile of Clustrix during a test1 run, and saw that the the execution engine dominates CPU time (as opposed to say SQL parsing or query planning). In looking at the profiles during test2 runs on Clustrix, I saw that we had a bunch of idle time in the system, so there's room for optimization.

But real-world loads tend to be read/write, so let's see how Mongo does when we add writes to the equation.

Test 3: Read/Write

My initial plan called for a combination of read-centric and write-centric loads. It seems that most web infrastructures are heavier on the reads than writes, but there are many exceptions. In Clustrix, we use Multi-Version Concurrency Control, which means that readers are never blocked by writers. We handle both read heavy and write heavy workloads equally well. Since MongoDB seems to do much better with reads than writes, I decided to stick to a read-centric workload.

The Clustrix test shows show very little drop off in performance for reads. On the Mongo side, I expected to see a drop off in performance directly proportional to the amount of write load.

However, what I saw mind blowing: Mongo completely starved the readers! The following graph shows the query load on one of the 10 shards during the write portion of the test. I simply started up a single write thread while letting the read test run. The write was active for all of 60 seconds, and it took Mongo an additional 15 seconds to recover after the writer stopped.

MongoDB
FAIL

Custrix
test1: 4,425 writes/sec and 49,099 reads/sec (total 53,524 queries/sec) (92% read / 8% write)
test2: 4,450 writes/sec and 625 reads/sec

The test2 aggregate query is much more computationally expensive compared to an insert. So the read/write ratio for test2 became very skewed. Note that Clustrix did not drop in read throughput at all.

Overall, you can see why every modern DBMS choose to go with the MVCC model for concurrency control.

Conclusion

The SQL relational model can clearly scale. The only place where MongoDB could compete with Clustrix was on pure read-only workloads. But that's just not representative of real world application loads

Building a scalable distributed system is more about good architecture and solid engineering. Now that we have scale and performance out of the way, I'm going to review the other important aspects of a DBMS in my upcoming posts.

Sergei Tsarev

Sunday, January 30, 2011

MongoDB vs. Clustrix Comparison: Part 1 -- Performance