Sunday, January 30, 2011

MongoDB vs. Clustrix Comparison: Part 1 -- Performance

UPDATE:

You can now download the benchmark code. As mentioned in my post, I populated both databases with 300M rows. I played around with the best client/host/thread ratio for each database to achieve peak throughput. I used MongoDB v1.6.5.

For the MongoDB read tests, I used the read-test.cpp harness. I didn't have time to do a proper getopt parsing for it, so I would modify it by hand for test runs. But it's very straight forward.

The C version of the MySQL/Clustrix harness is not included because I used a Clustrix internal test harness framework -- to save time. It doesn't impart Clustrix any advantage in the test, and it relies too much on our infrastructure to be of any value. You can still use the Python based test harness -- it just requires a lot of client cpu power.

UPDATE 2:

There's an interesting conversation over at Hacker News about this post.


Introduction

With all the recent buzz about NoSQL and non-relational databases, the marketing folks at Clustrix asked a question: Do we have the right solution for today's market? It's a fair question, especially since Clustrix is a fully featured RDBMS with a SQL interface. And we all heard that SQL doesn't scale, right?

So that brings us around to the next question: What do people actually want out of their database? Surely it's not simply the absence of a SQL based interface. Because if that's the case,  Berkeley DB would be a lot more popular than say SQLite. Over the years, we've had many conversations with people about their database needs. Over and over, the following has always come up a list of must have features for any modern system:

  • Incrementally Scalable Performance
  • High Availability and Fault Tolerance
  • Ease of Overall System Management

Interestingly enough, we never heard that SQL or the relational model was the root of all their problems. It appears that the anti-SQL sentiment came around with this sort of false reasoning.
I have a SQL based RDBMS.
I can't seem to scale my database beyond a single system.
Therefore SQL is the problem.

The NoSQL movement embraced this reasoning. The NoSQL proponents began to promote all sorts of "scalable" systems at the expense of venerable DBMS features like durability. And they kept going. What else don't we need? Well, we don't need Consistency! Why? Because that's really hard to do and keep performance.  Slowly but surely, these systems would claim to have a panacea for all of your scalable database needs at the expense of cutting features we've come to expect from 40 years of database systems design.

Well, that's just bullshit. There is absolutely nothing about SQL or the relational model preventing it from scaling out.

Over the next set of posts, I'm going to compare MongoDB and Clustrix using the above evaluation criteria: Scalable Performance, Availability and Fault Tolerance, and Ease of Use. I am going to start with Performance because no one believes that you can grow a relational database to Internet Scale. And to put the results into context, I chose to compare Clustrix to MongoDB because (1) it doesn't support SQL, (2) it can transparently scale to multiple nodes, and (3) it seems to be the new poster child for NoSQL.

Performance

Conducting performance benchmarks is always challenging. First, you have to decide on a model workload. Next, you have to accurately simulate that workload. The system under test should be as close as possible to your production environment. The list gets long. In the end, no benchmark is going to be perfect. Best you can hope for is reasonable.

So I looked at some common themes in the workloads from some of our customers, and decided that I would simulate a basic use case of keeping metadata about a collection of 1 Billion files. Whether you're a cloud based file storage provider or a photo sharing site, the use case is familiar. The test would use the appropriate access patterns for the database. Since MongoDB does not support joins, I'm not going to put it at a disadvantage by moving join logic into the application. Instead, I'm going to make full use of the native document centric interface.

Benchmark Overview
  • 10 node cluster of dedicated hosts (SSD, 8 cores/node, 32GB ram/node)
  • 2x replication factor
  • A data set containing information about 1 Billion files 300 Million (see bellow)
  • A read only performance test
  • A read/write performance test
  • Both databases will use the exact same hardware
The test uses the following schema:

CREATE TABLE files (
    id           bigint NOT NULL,
    path         varchar NOT NULL,
    size         bigint NOT NULL,
    server_id    int NOT NULL,
    deleted      smallint NOT NULL,
    last_updated datetime NOT NULL,
    created      datetime NOT NULL,
    user_id      bigint NOT NULL,
    PRIMARY KEY (id),
    KEY server_id_deleted (server_id, deleted),
    KEY user_id_updated (user_id, last_updated),
    KEY user_id_path (user_id, path)
);

Additionally, the data set has the following characteristics:
  • path is a randomly generated string between the length of 32 and 128 characters
  • server_id has a distribution of 0-32
  • deleted has 1% value 1, and the rest 0 (lumpy data distributions tests)
  • for MongoDB, we use user_id as the shard key

Test 1: Loading the Initial Data Set

The test harness itself is a multi-host, multi-process, and multi-threaded python application. Because of the GIL in python, I ended up designing the test harness so that it forks off multiple processes, with each process having some number of threads. It also turns out that I needed more than 10 client machines to saturate the cluster with reads using python, so I rewrote the read tests suing C++ for MongoDB and C for Clustrix.

While populating the dataset into MongoDB,  I kept on running into a huge drop off in performance at around 55-60M rows. A 10 node cluster has an aggregate of 320GB or ram and 80 160GB SSD drives. That's more than enough iron to handle that much data. As I started to dig in more, I saw that 85% of the data was distributed to a single node. MongoDB had split the data into multiple chunks, but its balancer could not (would not?) move the data to other nodes. Once the database size exceeded that node's available memory, everything went to shit. The box started thrashing pretty badly. It seems that under a constant high write load, MongoDB is unable to automatically redistribute data within the cluster.



To get the test going, I split the files collection into an even distribution. Without any load on the cluster, I watched MongoDB move the chunks onto the 10 replica sets for an even layout. Now I was finally getting somewhere.

Immediately, I noticed that MongoDB had a highly variable write throughput. I was also surprised at how low the numbers were. Which led me to discover Mongo's concurrency control: a single mutex over the database instance. Furthermore, in tracking the insert performance along with memory utilization, I could see that getting to more than 300 million records would spill some of the data set to disk. While that's a reasonable benchmark for a database, I decided that I would keep the data set memory resident for Mongo's sake.


The drop-off on the Clustrix happened because not all of the load scripts finished at the same time. A couple of the client nodes were slower, so they took a bit longer to finish up their portion of the load. 

For a system which eschews consistency and durability, the write performance on MongoDB looks atrocious. Initially, I thought that Mongo completely trashing Clustrix on the write performance.  The result was a complete surprise. Here's why I think Clustrix did so much better:
  • Highly concurrent b-tree implementation with fine grained locking
  • Buffer manager and transaction log tuned for SSD
  • Completely lock-free Split and Move operations (very cool stuff, another post)
Conversely, Mongo did so poorly because it:
  • Has a big fat lock, severely limiting concurrency
  • It relies entirely on the kernel buffer manager
Total time to load was 0:37 (hh:mm) on Clustrix and 4:47 on MongoDB.

Clustrix was 775% faster for writes than MongoDB!
And that's with fully durable and fully consistent writes on the Clustrix side.

    Test 2: Read Only

    The read test consists of the following basic workloads:
    • get the 10 latest updated files for a specific user
    • count the number of deleted files on a given server id

      I chose these queries because they are representative of the types of queries our example application would generate, and they are not simple point selects. Getting a distributed hash table working is easy. But DHTs tend to fall apart fairly quickly when queries start introducing ordering, examining multiple rows,  or other non key-value lookups. In other words, real-world use.

      MongoDB
      C++ test harness. Peak throughput at 64 concurrent client threads.
      db.files.find({user_id: user_id}).sort({last_updated: -1}).limit(10)
      55,103 queries/sec
      db.files.find({'server_id': server_id, 'deleted': 1}).count()
      675 queries/sec

      Clustrix
      C test harness. Peak throughput at 256 concurrent client threads.
      select * from benchmark.files where user_id = .. order by last_updated desc limit 10
      56,641 queries/sec 
      select count(1) from benchmark.files where server_id = .. and deleted = 1
      625 queries/sec

      So on a read-only test, MongoDB and Clustrix are within 1% of each other for test1. Clustrix is faster on test 1 and MongoDB is 7% faster on test2. I captured a profile of Clustrix during a test1 run, and saw that the the execution engine dominates CPU time (as opposed to say SQL parsing or query planning). In looking at the profiles during test2 runs on Clustrix, I saw that we had a bunch of idle time in the system, so there's room for optimization.

      But real-world loads tend to be read/write, so let's see how Mongo does when we add writes to the equation.


      Test 3: Read/Write

      My initial plan called for a combination of read-centric and write-centric loads. It seems that most web infrastructures are heavier on the reads than writes, but there are many exceptions. In Clustrix, we use Multi-Version Concurrency Control, which means that readers are never blocked by writers. We handle both read heavy and write heavy workloads equally well. Since MongoDB seems to do much better with reads than writes, I decided to stick to a read-centric workload.

      The Clustrix test shows show very little drop off in performance for reads. On the Mongo side, I expected to see a drop off in performance directly proportional to the amount of write load.

      However, what I saw mind blowing: Mongo completely starved the readers! The following graph shows the query load on one of the 10 shards during the write portion of the test. I simply started up a single write thread while letting the read test run. The write was active for all of 60 seconds, and it took Mongo an additional 15 seconds to recover after the writer stopped.


      MongoDB
      FAIL

      Custrix
      test1: 4,425 writes/sec and 49,099 reads/sec (total 53,524 queries/sec) (92% read / 8% write)
      test2: 4,450 writes/sec and 625 reads/sec

      The test2 aggregate query is much more computationally expensive compared to an insert. So the read/write ratio for test2 became very skewed. Note that Clustrix did not drop in read throughput at all.

      Overall, you can see why every modern DBMS choose to go with the MVCC model for concurrency control.

      Conclusion

      The SQL relational model can clearly scale. The only place where MongoDB could compete with Clustrix was on pure read-only workloads. But that's just not representative of real world application loads

      Building a scalable distributed system is more about good architecture and solid engineering. Now that we have scale and performance out of the way, I'm going to review the other important aspects of a DBMS in my upcoming posts.

      27 comments:

      1. Could you post the actual benchmarking code so that we could try it out ourselves (and maybe compare with other DB's as well)?

        ReplyDelete
      2. Just curious, did you contact the Mongo team to give them a chance at engineering their side of the solution? It seems a bit odd that you're not sharing the code to allow peer review and that you're fairly biased on the Clusterix side, being the founder.

        I think this is really interesting, and would agree in general that some people seem to think that NoSQL is the best solution for every problem (which it's clearly not), but at the same time I think it would be in the community's best interest for you to present the code for review.

        ReplyDelete
      3. Your math doesn't add up.

        Using the clustrix numbers, at 140k insets/sec for 37 seconds you get 5.1 M records. Even if that is per server, you get 51M records.

        Pleas explain why you cannot add?

        ReplyDelete
      4. I'm going to post the benchmarks later today.

        Sammy, it's 37 minutes, not seconds.

        ReplyDelete
      5. This comment has been removed by the author.

        ReplyDelete
      6. Also, what version of MongoDB are you running? They are making huge improvements all the time.

        Sorry about minute vs second.

        ReplyDelete
      7. Ok, I updated the post with benchmark source.

        ReplyDelete
      8. This comment has been removed by the author.

        ReplyDelete
      9. This comment has been removed by the author.

        ReplyDelete
      10. I would suggest contacting MongoDB and telling them about the benchmarks so they can review your configuration to make sure it was properly setup. I can guarantee your Clustrix configuration is setup properly since you are an original founder. Pehraps they can shed some light on a 775% write margin.

        ReplyDelete
      11. Scott,

        I think if you reread the article, then you will see that:

        1. I did manually pre-split the shard space because if I didn't do that, mongo would not actually rebalance the data under heavy load.

        2. The very first interaction that our customers have with clustrix is the initial data import. We've spent a lot of time optimizing for that case because while it's a rare workload, it's extremely important.

        3. I partitioned the space by user_id because you can see how that would be the dominant constraint on queries. Mongo allows only one shard key. If I split by say server_id (which would help the second query), then the user_id query would become a broadcast and performance would suffer.

        ReplyDelete
      12. How can one come to the conclusion, "The SQL relational model can clearly scale" when the presented experiment doesn't exercise any relational features of the Clustrix database?

        ReplyDelete
      13. Sergei,

        1.) Yes, I did miss that you said that things balanced after the data load, but you had to help. Did you (want to) test with pre-splitting, before you did the initial data load?

        2.) That might be true, but initial data loading is a small part of the data lifetime, in most systems.
        3.) You can specify a shard key as more than one field, just like an index can be compound.

        ReplyDelete
      14. Wouldn't you need to include at least one join to have a reasonable comparision? If you choose MongoDB, then you have a document oriented data model. I'm thinking of situations like line-items on an invoice. They are totally owned by the invoice. When the invoice goes away, the line items go away. Using a traditional DB, you'd have two tables and use a join in your queries. In MongoDB, you'd include the invoice line items in the invoice document.

        To be fair, you would NOT store the customer information in the invoice document even if you were using MongoDB. It would take two round trips to get it in MongoDB but you could get it in one with a join using SQL.

        ReplyDelete
      15. This is not a fair comparison. First of all, you are the founder of Clustrix. Second, from what you write, you were clearly learning how MongoDB works on the fly, which leads me to believe that you have not optimized MongoDB for your particular situation. Also, you did not bother to utilize any relational features of Clustrix vs the document oriented structure of MongoDB. Since the initial data import is one of your major concerns, did you bother to compare the mongoimport tool to the Clustrix initial data import?

        ReplyDelete
      16. Your whole premise misses the point. mostly, see Jared's complaint above, but I'll elaborate.

        No one with a brain thinks SQL, at its core, is a bad idea. However, one can easily shoot themselves in the foot. The promise of NoSQL is that you can scale your cluster without having a database admin of 10 years experience and multiple certifications around to audit your developers' poor schema choices and badly written queries. You can put a limited SQL language in front of such a system if you want, but why? No one wants to learn a separate language to query their database anyway. None of this means all the great ideas of the last 25 years should be thrown out. NoSQL developers should still be writing good concurrent code.

        Your results are still interesting. Probably most interesting to the MongoDB team. But you shouldn't pretend for a second this has anything to do with NoSQL vs SQL. In fact, realize that what you're doing here is adding fuel to the fire and stoking confusion. You, sir, are contributing to the misconception that there's a rabid, anti-SQL camp. There isn't. Stop, please.

        ReplyDelete
      17. Thanks for such a Benchmark.
        Despite I'm a MongoDB fan, I have to admit I meet sometimes your issues, mostly about R/W.

        BUT, I'm wondering something : did you preallocate your files ? That's really matter when performing these tests...

        Thanks for answering :)

        ReplyDelete
      18. So how does this test answer the features for any modern system?

        "Over and over, the following has always come up a list of must have features for any modern system:

        Incrementally Scalable Performance
        High Availability and Fault Tolerance
        Ease of Overall System Management
        "

        ReplyDelete
      19. @robo: the system scaled to 10 nodes with a query throughput rate that's higher than you can get on any single node system. That's scalability. Furthermore, we performed better than Mongo in many aspects of the test. That's performance.

        This is part 1 of a series of posts that focused on performance. I'm going to address the other aspects. Stay tuned.

        ReplyDelete
      20. I look forward to the next post. Thanks for providing so many details.

        ReplyDelete
      21. The performance and scalability question is certainly interesting and I don't think this comparison is trying to say either way of approaching the probably is fundamentally running into scalability issues (obviously MongoDB has been doing work and has more work to do to remove performance bottlenecks). And setting aside the question of free vs. paid products, which is probably the major reason why MySQL et al. are so much more dominant than Oracle/DB2/SQL Server in the web space. It seems like a decision between a traditional RDBMS (like MySQL or Clustrix), a document store (like MongoDB or CouchDB) or a key value store (like Redis and the Memcache DB types) depends on your data set and usage characteristics.

        One thing the document stores provide are schema-less data stores. Depending on your data set, sometimes this sucks (eg. no typing of your field types and thus your ID field sometimes being a string and sometimes being a number because your client didn't cast the field appropriately; and having to waste space storing your field names in the document), but sometimes this is great (eg. adding a new field to your user object doesn't require altering your table structure which involves sync'ing code and DB ops; plus performance issues when expanding your table; plus having an unwieldy schema when you have limitless different types of fields).

        If you need to do joins, obviously key value stores and document stores suck, unless their performance is so great that multiple lookups is fast enough. However most web applications try to stay away from joins because it's easier to scale up the number of client machines to scale up the database machines..

        If you're using your data store to store transactional data, then obviously you're going to want a RDBMS. But again, a number of usage scenarios for web applications are not going to care about transactions or ACID compliance..

        I think the proliferation of NoSQL is really just an extension of the database movement (which even SQL database vendors like VoltDB are following) that making specialized data store systems for specialized purposes can open up a realm of undiscovered possibilities and applications..

        ReplyDelete
      22. @Sergei, That's great that you found some bugs in Mongod and Clustrix sounds fantastic, but I don't see any published pricing for your setup and it seems to be a closed source, proprietary appliance.

        I don't have a big budget (actually I don't want to spend anything on software unless I get into trouble) and I want to run on commodity, general purpose boxes, with the option of cloud bursting out onto EC2 if it becomes necessary.

        Could you explain to me how your Clustrix product will make it possible for ME to run an MPP OLTP db?

        With respect, Clustrix looks like a system that can't pick on IBM DB2 MPP or Oracle 11g RAC so its beating up the little guys instead.

        ReplyDelete
      23. Rob, if you want something for free, then Clustrix is not for you.

        IBM DB2 MPP is a data warehousing solution for analytics -- not OLTP. It competes with the likes of Teradata and Netezza.

        Oracle RAC costs at least 10x more than Clustrix, and its shared disk architecture prevents them from scaling out. Clustrix wins at scale. You don't have take my word for it. Go talk to folks who have actually tried to scale RAC for OLTP.

        And 10gen (maker of mongo) raised over $11M so far.

        ReplyDelete
      24. Sergei, I'm actually in the market for a midrange MPP DBMS system and MySQL language compliance is appealing.

        I'd love to get something for free; but I haven't yet. IBM DB2 has the PureScale option for scale out OLTP workloads and Netezza is actually under the ownership of IBM since the last two months.

        I have actually had to deal with maxed out 50 node RAC infrastructure in the past, so you're preaching to the converted, thanks.

        Happy to take the discussion offline - convince me.

        ReplyDelete
      25. Hi Sergei

        I checked out your test harness and ran the tests myself in my lab. This is what I found:

        You have just a single shard per host on the mongodb side. It seems unlikely that this single process will be able to make the best possible use of all that iron.

        I don't have the budget you got, so I ran on:

        3x DC Xeon 2.7Ghz 2GB RAM 1x40GB scsi hdd 1x 1GbE
        Ubuntu 10.10 32bit
        MongoDB 1.6.5 GA, same as you

        One of my systems is running Ganglia and is running the C++ test harness, in addition to MongoDB.

        I ran your tests with two shards per system (one for each core :D) for around 8 hours.

        I got a steady 2000 reads/sec and 400 writes/sec per shard. Aggregated, that's 12000 reads/sec and 2400 writes/sec on my $200 setup.

        I am quite happy with that, even if it is artificial workloads.

        I wonder what the Mongod performance profile would look like if you set up you 8-way SSD systems with eight shards per host instead of one?

        TIA

        ReplyDelete
      26. I think it was actually a good comparison. Mongo collapsed under certain situations. This is something the mongo team should address.
        I wonder how cassandra compares.
        I have no idea what clustrix is, but I'm sure it's a nice sqlproxy with some memcached in between :-)
        My motivations for using "nosql (hate the word)" is because of the flexibility of the datamodel / containment, and the new "join-models", and the built-in map-reduce
        Also, I've always thought it's weird to write sql for most cases. ORMs are here for a reason.

        What I would also like to know is what happens on all three platforms (mongo, cassandra, clustrix) when something goes wrong (master or slave dies, a disk failure happens, etc).

        Just read the clustrix site.. My assumptions were correct. I think it's a viable product, which reduces complexity you see often when trying to scale mysql (replication, sharding, and eventually moving to memcached)

        ReplyDelete
      27. To quote you: "Once the database size exceeded that node's available memory, everything went to shit."

        This is a limit that *most* NoSQL solutions have today. They work great until data fits in memory.

        -Darpan
        http://darpanetwork.blogspot.com/

        ReplyDelete