Friday, November 4, 2011

Distributed Database Architectures: Distributed Storage / Centralized Compute

In my previous post I wrote about shared disk architectures and the problems they introduce. It's common to see comparisons between shared disk and shared nothing architectures, but that distinction is too coarse to capture the differences between various shared nothing approaches.

Instead, I'm going to characterize the various "shared-nothing" style systems by their query evaluation architectures. Most systems fall into one of the following buckets:
  • Centralized compute
  • Limited distributed compute
  • Fully distributed compute

Centralized Compute: MySQL Cluster

MySQL Cluster consists of two basic roles used for servicing user queries: a compute role and a storage/data role. The compute node is the front end which takes in the query, plans it, and executes it. The compute node will communicate with the storage nodes remotely to fetch any data relevant to the query.

In the distributed storage model, data is no longer shared between the nodes at the page level. Instead, the storage nodes expose a higher level API which allows the compute node to fetch row ranges based on the available access paths (i.e. indexes).

In such a system, storage level locks associated with the data are now managed exclusively by the storage node itself. A compute node does not cache any data; instead, it always asks the set of storage nodes responsible for the data. The system solved the cache coherence overhead problem.
However, it still suffers from extensive data movement and centralized query evaluation. 

  • Cache coherence overhead
  • Extensive data movement
  • Centralized query evaluation

MySQL Query Evaluation in Action

Consider the following example:

SELECT count(*) FROM mytable WHERE acol = 1 and bcol = 2

  • an index over acol
  • 10% of the rows in the table match acol = 1
  • 3% of the rows match acol = 1 and bcol = 2
  • total table size 1 Billion rows

In the diagram above, the arrows represent the flow of data through the system. As you can see, most of the query evaluation in the example is done by a single compute node.  The system generated a data movement of 100 million rows, and only a single node performed additional filtering and aggregate count.

It's an improvement over a shared disk system, but it still has some serious limitations. Such a system could be well suited for simple key access (i.e. query touches a few specific rows), but any more complexity will generally result in poor performance

As with the shared disk system, adding more nodes will not help improve single query execution, and queries which operate over large volumes of data have the potential to saturate the message bus between the nodes.

Distributed Database Architectures: Shared Disk

One of the most common questions I get about Clustrix is "How is Clustrix different than Database X?" Depending on who asks the question, Database X tends to be anything from Oracle RAC to MySQL Cluster. So I decided to put together a primer on different types of Distributed Database Architectures, and what each one means for real world applications.

Shared Disk Databases

Shared disk databases fall into a general category where multiple database instances share some physical storage resource. With a shared disk architecture, multiple nodes coordinate access to a shared storage system at a block level.

Adding clustering to a stand alone database through a shared disk cache.

Several of the older generation databases fall into this category, including Oracle RAC, IBM DB2 pureScale, Sybase, and others.

These systems all started out as single instance databases. The easiest way to add clustering to an existing database stack would be to share the storage system between multiple independent nodes. All of these databases already did paged disk access through a buffer manager cache. Make the caches talk to each to manage concurrence, and bam, you have a distributed database!

Most of the database stack remains the same. You have the same planner, the same optimizer, the same query execution engine. It's a low risk way to extend the database to multiple nodes. You can add more processing power to the database, keep data access transparency in the application layer, and get some amount of fault tolerance.

But such systems also have serious limitations which prevent them from getting very wide spread adoption, especially for applications which require scale. They simply don't scale for most workloads (footnote: see Scale w/ Sharing below), and they are extremely complex to administer.

Almost every new distributed database system built in the last 10 years has embraced a different architecture, mainly because the shared disk model has the following problems:
  • Cache coherence overhead
  • Extensive data movement across the cluster
  • Centralized query execution (only a single node participates in query resolution)

Cache Coherence and Page Contention

Let's assume the following workload as the first example:
DB 1: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = X
DB 2: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = Y

Now let's further assume that some of the keys for that update statement will share a page on disk. So we don't have contention on the same key, but we do have some contention on the same disk page.

Page contention in shared-disk systems.

Both nodes receive similar update queries. Both nodes must update the same physical page. But in order to do it safely, they must insure that all copies of the page are consistent across the cluster.

Managing such consistency comes at a cost. Consider some of the requirements which have to be satisfied:
  • Every node must acquire a page lock in order to read or write from the page. In a clustered environment, such locking results in communication between nodes.
  • If a node acquires a write lock on a page and modifies it, it must notify any other node to invalidate their caches.
  • If a node gets a page invalidation request, it must re-fetch the latest copy of the page, which also results in more network communication.
  • Each node caches the contents of the entire data set. It means that the effective cache size of the cluster is as large as the cache on any of the nodes. Adding more nodes does not scale cache efficiency.

Now imagine adding more nodes to your cluster. You end up with a system which has non-linear scaling in message complexity and data movement.  It's common to see such systems struggle beyond 4 nodes on typical OLTP workloads.

Data Movement

Consider another example which poses problems for shared disk systems:

SELECT sum(mycol) FROM mytable WHERE something = x

Let's assume that mytable contains 1 Billion rows and that the something = x  leaves us 1,000 rows. With a shared disk system, if the predicate results in a table scan, the entire contents of the 1B row table must be transferred to the node evaluating the query!  

And from the previous example, we see that it's not just a matter of data movement across the SAN infrastructure. The system must also maintain cache coherence, which means lots of cache management traffic between all the nodes. Such queries on large data sets can bring the whole cluster to its knees.

Centralized Query Resolution

It's also worth while to point out that only a single node within the cluster can participate in query evaluation. So even if you throw more nodes at your system, it's not generally going to help you speed up that slow query. In fact, adding more nodes may slow down the system as the cost of managing cache coherence increases in a larger cluster.

Scaling with Sharding

It's possible to scale with shared disk, but it requires an extensive engineering effort. The application is written (a) to keep affinity of data accesses to nodes and (b) to keep each shard within its own data silo. You lose transparency of data access at the application level. The app can no longer send any query to any one of the nodes, or it will result in a catastrophic performance problem.

While you end up with a very expensive means of implementing sharding, there is some advantage to this approach over regular sharding. In case of node failure, you can more easily bring in a host standby without having to keep a dedicated slave for each shard. But in practice, getting fault tolerance right with a SAN based shared disk infrastructure is no easy task -- just ask anyone who manages SANs for a living.