Monday, February 28, 2011

Profile Driven Performance Optimization

Recently, one of our support engineers noticed that a customer cluster got a little sluggish during a routine maintenance operation. In looking back at historical data, we noticed that the cluster saw a decrease in transaction throughput. The system should have done its cleanup in the background at a low priority, but that didn't happen. I thought it would be interesting to describe our process for addressing a performance problem like this one. I will also share the tools we developed and used.

Reproduce, Measure, and Automate

The symptom our support engineer noticed was a "sluggish" system. More accurately, we saw that query latencies increased and overall system throughput dropped. So our first step was to reproduce the problem under some approximation of the customer's workload. It was a bit tricky to get all of the prerequisite conditions just right, but we finally landed on the right set of circumstances.

Once we reproduced the issue, the next step was to automate the test to ensure that we could consistently demonstrate the problem. It boiled down to using one of our standard performance harnesses with a couple of tweaks.

The Profile

Our most commonly used profiling tool is based on the Linux oprofile infrastructure. But we had to modify it to better suit our system. Why? Because substantial portions of our codebase are written in an event driven/continuation passing style programming model. Call graph reporting in ofprofile is highly dependent on the stack, and the C stack often does not contain enough information to provide a valuable analysis in our system.

To get around this problem, we modified oprofile to gather additional information from the system. The key concept was adding a method for tagging a series of continuation calls and getting oprofile to sample these tags. The tag itself includes enough information to map the executing code to a logical system module hierarchy. The screen shot below shows an example output from one of our reporting tools.

Coming back to our original problem, the screen shot shows the output of the profile analysis tool during the dip in performance. The tool allows us to isolate the report to a specific processor. In this case we're looking at the output for our management core.

Interpreting the results we see that:
  • The execution engine dominates performance, and most of that is in the storage engine
  • The storage engine spends 63% of all its cycles on CRC32 calculations
  • Our particular maintenance task was erroneously bound to the management core
  • The task should have a lower priority

CRC Performance

The root cause of the performance degradation was a scheduling issue. However, it caused me to take a closer look at our crc algorithm. We use the crc implementation from zlib, and it should be reasonably well optimized. Whipping up a quick benchmark, I we see that the zlib implementation takes about 41us to checksum a 32k block. In the test, we were reading about 224MB/s per node. That's 293ms of checksums in a second. No wonder we saw a performance drop.

I looked around and found a paper from Intel describing their slice-by-eight crc approach. A quick benchmark revealed that the same 32k checksum could be done in 27us -- a 37% percent improvement over zlib's version.

With Nehalem-based processors, Intel introduced an SSE instruction for a hardware based crc computation. I haven't had a chance to conduct my own benchmarks, but  the following results suggest that we could get a 2-3x speedup over slice by eight.

Source Code

I'm releasing the source for the tools referenced in this post. The user space oprofile daemon replacement requires our database runtime which I am not making public.

    Monday, February 7, 2011

    Clustrix as a Document Store: Blending SQL and JSON Documents

    Many responses to my previous post claimed that my comparisons to MongoDB were unfair because MongoDB is a "document store" and Clustrix is a SQL RDBMS.  Somehow that distinction made almost any comparison invalid. In response, I spent my weekend coding up a prototype document store interface for Clustrix to demonstrate that a different data model is not in itself an architectural differentiator.

    At first I thought of just creating a different front end for Clustrix; our architectural model makes this easy. However, I quickly decided against it because in many ways, it would be much more limiting than a SQL based interface (e.g. joins, flexibile aggregation, subqueries, etc.)

    Instead, I extended our SQL syntax to support native operations on JSON objects. So now you can do the following:

    clustrix> create table files (id int primary key auto_increment, doc json);
    Query OK, 0 rows affected (0.04 sec)

    clustrix> insert into files (doc) values ('{"foo": {"bar": 1}, "baz": [1,2,3,4]}');
    Query OK, 1 row affected (0.00 sec)

    clustrix> select id, files.doc::foo.baz from files where = 1;
    | id | files.doc::foo.baz |                                
    |  1 | [1,2,3,4]          |                   
    1 row in set (0.00 sec)

    clustrix> create index foo_bar on files (;
    Query OK, 0 rows affected (0.08 sec)

    The database has native support for dealing with JSON documents. We're not simply storing text blobs inside of some column and getting them back. We're exposing the contents of the JSON document to the underlying planner and execution engine. Immediately, we get the following advantages:

    • Ability to do joins across document collections
    • Extremely powerful and flexible query language
    • Ability to index into JSON objects
    • Transactional semantics built-in

    Taking it alittle further, I added a select clause modifier which instructs the database to return all row data as a JSON document, including fields which come from "a relational column." The following example shows how the database can seamlessly join between our json data type and other data types in the system, and then return the result as a JSON objects.

    clustrix> select _json_ f.doc, u.doc::username
        ->   from files f
        ->   join users u on f.doc::user_id = u.user_id
        ->  where = 1\G
    *************************** 1. row ***************************
    json: {"f.doc": {"foo": {"bar": 1}, "baz": [1,2,3,4]}, "u.doc::username": "sergei"}
    1 row in set (0.01 sec)

    We can continue to extend the syntax. For example, adding operators to manipulate lists within JSON. Or adding optional schema checking for contents of the JSON (i.e. something along the lines of DTD for XML). I'm sure you can think of more.

    An any case, one can build a system which combines the best characteristics of a document store with the power of SQL. Both models can coexist in the same database, allowing the devloper to trully choose the data model which best suits his or her needs.

    Wednesday, February 2, 2011

    Sierra and the Clustrix Database Stack

    A lot of the responses to my previous posts criticized my choice of comparing a document database like MongoDB to a relational database like Clustrix. I tried to examine aspects of the database architecture which have nothing to do with the data model, but somehow the data model would always come up. There is a set of concerns that's common to all database systems, whether you are a document store or a relational database.

    But first, think about how your database would handle the following workload. Chose whatever data model you find best suited for my use case.

    get me all the records from foo where a = ? and b = ?
    • foo has an index over A and another index over  B.
    • A and B have non-uniform data distributions. A is 90% value "X" and B is 90% value "Y".
    • We have 1 billion records in foo.
    • The database has a choice: index A, index B, or scan and filter.
    • 50% of the queries are a = X and b = Z, the other 50% are a = Z and b = Y
    What does your database do?

    If the database always chooses index A or index B for all queries, then it ends up examining 900,000 rows 50% of the time.

    We'll come back to my question later. First, I will briefly describe what the database stack in Clustrix looks like. I need to set a foundation so that my next set of posts makes sense.

    To anyone who has experience with DBMS systems, the stack should look very familiar. We have fairly strict abstraction of interfaces between the various portions of the stack.

    The Protocol Handler and Query Parser are responsible for taking in user connections and translating SQL into our intermediate representation called Sierra. We can actually support multiple type of dialects at these two layers; the constraint is that we must be able to express the query language in Sierra. And Sierra is much more expressive than SQL. The value of Sierra is that it provides an extensive planner framework for reasoning about distributed database queries.

    So the Planner/Optimizer only accepts Sierra. It runs through a search space of possible plans and prunes them based on cost estimates. After coming up with the best plan candidate, it translates into another intermediate representation used by the Distributed Compiler, which reasons out the physical execution plan for our query. Finally, we compile the query into machine code and execute it.

    As a performance optimization, we cache the compiled programs and plans. Clustrix does not need to optimize and compile every query, and you don't need to use prepared statements to get this behavior.

    Back to my question. What did you come up with? Well, I can tell you what happens in Clustrix. During the planning phase of Sierra, we examine the various statistics that the database keeps about the data distribution in our indexes. The statistics include tracking:
    • Number of distinct values across index columns
    • Quantile distributions
    • Hotlist tracking top n values within a column

    Clustrix correctly chooses the plan which will result in the least number of rows examined for all input parameters.

    Now, don't get me wrong. I am not claiming that data distribution statistics are unique to Clustrix. On the contrary, they are very common in any modern RDBMS. I'm using it as an example of a requirement that's independent of any data model, and it's actually a very important feature to have.

    Tuesday, February 1, 2011

    MongoDB vs. Clustrix Comparison: Part 2


    In Part 1 of my comparison, I ran some performance benchmarks to establish that relational systems can scale performance. In this post I would like to focus more on the High Availability and Fault Tolerance aspects of the two systems. The post will go over the approach of each system and what it means for fault tolerance and availability. I will also conduct a test of the claims: I'm going to fail a node by pulling power to see what happens.

    A Primer on Clustrix Data Distribution

    Clustrix has a fine grained approach to data distribution. The following graphic demonstrates the basic concepts and terminology used by our system. Notice that unlike MongoDB (and many other systems for that matter), Clustrix applies a per-index distribution strategy.

    There are many interesting implications for query evaluation and execution in our model, and the topic deserves its own set of posts. For the curious, you can get a brief introduction to our evaluation model from our white paper on the subject. For this post, I'm going to stick to how our model applies to fault tolerance and availability.

    You can find documentation for MongoDB's distribution approach on their website. In brief, MongoDB chooses a single distribution key for the collection. Indexes are co-located with the primary key shard.

    Clustrix Fault Tolerance and Availability Demo
    Fault Tolerance

    Both Clustrix and MongoDB rely on replicas for fault tolerance. A loss of a node results in a loss of some copy of the data which we can find elsewhere in the system. The MongoDB team put together a good set of documentation describing their replication model. Perhaps one of the most salient differences between the two approaches is the granularity of data distribution.  The unit of recovery on Clustrix is the replica (a small portion of an index), while the unit of recovery in MongoDB is a full instance of a Replication Set.

    For Clustrix, this means that the reprotection operation happens in a many-to-many fashion. Several nodes copy small portions of data from each of their disks to several other nodes onto many disks. The advantages of this approach are:

    • No single disk in the system becomes overloaded with writes or reads
    • No single node hotspot for driving the reprotect work
    • Incremental progress toward full protection
    • Independent replica factors for each index (e.g. primary key 3x, indexes 2x)
    • Automatic reprotection which doesn't require operator intervention
    • All replicas are always consistent

    One of the interesting aspects of the system is the complete automation of every recovery task. It's built in. I don't have to do anything to make that happen. So if I have a 10 node system, and a node fails, in an hour or so I will have a completely protected  9 node system without any operator intervention at all. When the 10th node comes back, the system will simply perceive a distribution imbalance and start moving data back onto that node.

    While the Replica Sets feature in MongoDB is nicer than replication in say MySQL, it's still highly manual. So in contrast with the above list for Clustrix, for MongoDB we have:
    • Manual intervention to recover from failure
    • The data is moved in a one-to-one fashion
    • All data within a Replica Set has the same protection factor
    • Failures can lead to inconsistency


    Both systems rely on having multiple copies of data for Availability. I've seen a lot of interesting discussion recently about the CAP theorem and what it means for real-world distributed database systems. It's another deep topic which really deserves its own set of posts, so I'll simply link to a couple posts on the subject which I find interesting and illuminating:
    At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.

    However, Clustrix also offers a lower level of consistency in the way of asynchronous replication between clusters. So if you want to setup a disaster recovery target in another physical location over high-latency link, we're able to accommodate that mode. It simply means that your backup cluster may be out of date by some number of transactions.

    MongoDB has relaxed consistency all around. The Replication Set itself uses an asynchronous replication model. The MongoDB guys are upfront about the kinds of anomalies they expose. The end user gets the equivalent of read uncommitted isolation. Mongo's claim is that they do this because they (1) can achieve higher performance, and (2) "merging back old operations later, after another node has accepted writes, is a hard problem." Yes. Distributed protocols are a hard problem, but it doesn't mean you should punt on them.

    Availability Continued

    There's also a more nuanced discussion to availability. One of the principal design features of Clustrix has been to aim for lock-free operation whenever possible. We have Multi-Version Concurrency Control (MVCC) deeply ingrained in the system. It allows a transaction to see a consistent snapshot of the database without interfering with writes. So a read in our system will not block a write.

    Building on top of MVCC, Clustrix has implemented a transactionally safe, lockless, and fully consistent method for moving data in the cluster without blocking any writes to that data. All of this happens completely automatically. No administrator intervention required. So when the Rebalancer decides to move a replica from Node 1 to Node 3, the replica can continue to take writes. We have a mechanism to sync changes to the source replica with the target replica without limiting the replica availability.

    Compare that to what many other systems do: read lock the source to get a consistent view for a replica copy. You end up locking out writers for the duration of the data copy. So while your data is available for reads, it is not available for writes.

    After a node failure (or to be more precise, a replica failure within a set), MongoDB advocates the following approach:
    1. Quiesce the master (read lock)
    2. Flush dirty buffers to disk  (fsync)
    3. Take an LVM snapshot of the resulting files
    4. Unlock the master
    5. Move the data files over to the slave
    6. Let the slave catch up from the snapshot
    So a couple of points (a) the MongoDB is not available for writes during steps (1) and (2),  and (b) it's a highly manual process. It reminds me very much of the MySQL best practices for setting up a slave.


    I've seen a lot of heated debates about Consistency, Availability, Performance, and Fault Tolerance. These issues are deeply interconnected and it's difficult to write about any of them in isolation. Clustrix maintains a high level of performance without sacrificing consistency and very high degree of availability. I know that it's possible to build such a system because we actually built it. And you shouldn't sacrifice these features in your application because you believe it's the only way to achieve good performance.