Monday, February 7, 2011

Clustrix as a Document Store: Blending SQL and JSON Documents

Many responses to my previous post claimed that my comparisons to MongoDB were unfair because MongoDB is a "document store" and Clustrix is a SQL RDBMS.  Somehow that distinction made almost any comparison invalid. In response, I spent my weekend coding up a prototype document store interface for Clustrix to demonstrate that a different data model is not in itself an architectural differentiator.

At first I thought of just creating a different front end for Clustrix; our architectural model makes this easy. However, I quickly decided against it because in many ways, it would be much more limiting than a SQL based interface (e.g. joins, flexibile aggregation, subqueries, etc.)

Instead, I extended our SQL syntax to support native operations on JSON objects. So now you can do the following:

clustrix> create table files (id int primary key auto_increment, doc json);
Query OK, 0 rows affected (0.04 sec)

clustrix> insert into files (doc) values ('{"foo": {"bar": 1}, "baz": [1,2,3,4]}');
Query OK, 1 row affected (0.00 sec)

clustrix> select id, files.doc::foo.baz from files where files.doc::foo.bar = 1;
+----+--------------------+
| id | files.doc::foo.baz |                                
+----+--------------------+
|  1 | [1,2,3,4]          |                   
+----+--------------------+
1 row in set (0.00 sec)

clustrix> create index foo_bar on files (doc::foo.bar);
Query OK, 0 rows affected (0.08 sec)

The database has native support for dealing with JSON documents. We're not simply storing text blobs inside of some column and getting them back. We're exposing the contents of the JSON document to the underlying planner and execution engine. Immediately, we get the following advantages:

  • Ability to do joins across document collections
  • Extremely powerful and flexible query language
  • Ability to index into JSON objects
  • Transactional semantics built-in

Taking it alittle further, I added a select clause modifier which instructs the database to return all row data as a JSON document, including fields which come from "a relational column." The following example shows how the database can seamlessly join between our json data type and other data types in the system, and then return the result as a JSON objects.

clustrix> select _json_ f.doc, u.doc::username
    ->   from files f
    ->   join users u on f.doc::user_id = u.user_id
    ->  where f.doc::foo.bar = 1\G
*************************** 1. row ***************************
json: {"f.doc": {"foo": {"bar": 1}, "baz": [1,2,3,4]}, "u.doc::username": "sergei"}
1 row in set (0.01 sec)

We can continue to extend the syntax. For example, adding operators to manipulate lists within JSON. Or adding optional schema checking for contents of the JSON (i.e. something along the lines of DTD for XML). I'm sure you can think of more.

An any case, one can build a system which combines the best characteristics of a document store with the power of SQL. Both models can coexist in the same database, allowing the devloper to trully choose the data model which best suits his or her needs.

4 comments:

  1. Sergei, interesting stuff!

    Just curious regarding your last sql statement. What happens if it would return more than 1 row? would it simply return 1 document per row? It would be really useful to be able to aggregate the entire resultset into one single document.

    ReplyDelete
  2. Yes, it returns a row per document. What you propose is possible, but it seems impractical. For large result sets, you end up throwing around very large documents.

    ReplyDelete
  3. Great stuff for us in oracle universe, Thanks Sergei

    ReplyDelete
  4. Very interesting, Sergei. For Roland and people with similar needs, maybe an optional AS clause can be specified to output multiple rows as a single document?

    ReplyDelete