Friday, November 4, 2011

Distributed Database Architectures: Shared Disk

One of the most common questions I get about Clustrix is "How is Clustrix different than Database X?" Depending on who asks the question, Database X tends to be anything from Oracle RAC to MySQL Cluster. So I decided to put together a primer on different types of Distributed Database Architectures, and what each one means for real world applications.

Shared Disk Databases

Shared disk databases fall into a general category where multiple database instances share some physical storage resource. With a shared disk architecture, multiple nodes coordinate access to a shared storage system at a block level.

Adding clustering to a stand alone database through a shared disk cache.

Several of the older generation databases fall into this category, including Oracle RAC, IBM DB2 pureScale, Sybase, and others.

These systems all started out as single instance databases. The easiest way to add clustering to an existing database stack would be to share the storage system between multiple independent nodes. All of these databases already did paged disk access through a buffer manager cache. Make the caches talk to each to manage concurrence, and bam, you have a distributed database!

Most of the database stack remains the same. You have the same planner, the same optimizer, the same query execution engine. It's a low risk way to extend the database to multiple nodes. You can add more processing power to the database, keep data access transparency in the application layer, and get some amount of fault tolerance.

But such systems also have serious limitations which prevent them from getting very wide spread adoption, especially for applications which require scale. They simply don't scale for most workloads (footnote: see Scale w/ Sharing below), and they are extremely complex to administer.

Almost every new distributed database system built in the last 10 years has embraced a different architecture, mainly because the shared disk model has the following problems:
  • Cache coherence overhead
  • Extensive data movement across the cluster
  • Centralized query execution (only a single node participates in query resolution)

Cache Coherence and Page Contention

Let's assume the following workload as the first example:
DB 1: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = X
DB 2: UPDATE mytable SET mycol = mycol + 1 WHERE mykey = Y

Now let's further assume that some of the keys for that update statement will share a page on disk. So we don't have contention on the same key, but we do have some contention on the same disk page.

Page contention in shared-disk systems.

Both nodes receive similar update queries. Both nodes must update the same physical page. But in order to do it safely, they must insure that all copies of the page are consistent across the cluster.

Managing such consistency comes at a cost. Consider some of the requirements which have to be satisfied:
  • Every node must acquire a page lock in order to read or write from the page. In a clustered environment, such locking results in communication between nodes.
  • If a node acquires a write lock on a page and modifies it, it must notify any other node to invalidate their caches.
  • If a node gets a page invalidation request, it must re-fetch the latest copy of the page, which also results in more network communication.
  • Each node caches the contents of the entire data set. It means that the effective cache size of the cluster is as large as the cache on any of the nodes. Adding more nodes does not scale cache efficiency.

Now imagine adding more nodes to your cluster. You end up with a system which has non-linear scaling in message complexity and data movement.  It's common to see such systems struggle beyond 4 nodes on typical OLTP workloads.

Data Movement

Consider another example which poses problems for shared disk systems:

SELECT sum(mycol) FROM mytable WHERE something = x

Let's assume that mytable contains 1 Billion rows and that the something = x  leaves us 1,000 rows. With a shared disk system, if the predicate results in a table scan, the entire contents of the 1B row table must be transferred to the node evaluating the query!  

And from the previous example, we see that it's not just a matter of data movement across the SAN infrastructure. The system must also maintain cache coherence, which means lots of cache management traffic between all the nodes. Such queries on large data sets can bring the whole cluster to its knees.

Centralized Query Resolution

It's also worth while to point out that only a single node within the cluster can participate in query evaluation. So even if you throw more nodes at your system, it's not generally going to help you speed up that slow query. In fact, adding more nodes may slow down the system as the cost of managing cache coherence increases in a larger cluster.

Scaling with Sharding

It's possible to scale with shared disk, but it requires an extensive engineering effort. The application is written (a) to keep affinity of data accesses to nodes and (b) to keep each shard within its own data silo. You lose transparency of data access at the application level. The app can no longer send any query to any one of the nodes, or it will result in a catastrophic performance problem.

While you end up with a very expensive means of implementing sharding, there is some advantage to this approach over regular sharding. In case of node failure, you can more easily bring in a host standby without having to keep a dedicated slave for each shard. But in practice, getting fault tolerance right with a SAN based shared disk infrastructure is no easy task -- just ask anyone who manages SANs for a living.

    No comments:

    Post a Comment