NoSQL Databases: Key-Value and Document Stores

Both key-value and document data stores have arisen to support object and document storage and retrieval on a massive scale. In these databases, the schema is either non-existent or highly dynamic, allowing the developer to remove and modify fields that are stored as in a singular container or document. This article examines the major functionality of key-value stores and document stores by exploring several popular key-value and document stores including Membase, SimpleDB, CouchDB, MongoDB, and Terrastore. This article is the second part in a three part series providing a general overview of NoSQL databases.
This is part two of a three part series:

Introduction
Key-Value and Document Stores
Extensible Record Stores (planned)

The databases covered in this article:

Key-Value stores: Membase
Document stores: SimpleDB, CouchDB, MongoDB, Terrastore

Key-Value Stores: Membase

The most trivial type of NoSQL data store is the key-value data store. Key-value stores store data values into a system that can later be recalled using a key. The values are uninterpreted arbitrary data that the application can interpret as it sees fit. This simple and completely schemaless data model allows for easy scaling and very simple APIs that can easily enhance the functionality of existing systems.

Membase is an open source data store developed to be as simple as memcached but with the ability to offer disk persistence for storing data behind interactive web applications. Membase offers a very simple key-value store that features data consistency and sustained high-throughput performance with replication and failover.

Data Model

The data model of Membase is a simple key-value pair that is schemaless using the same API supported by memcached. The client can store arbitrary data into the system and recall data from the system using a client-defined key. The data is schemaless and completely uninterpreted by the data store.

Architecture

Membase can be thought of as an extension to memcached. It offers the same API as memcached and includes memcached as a subset of its abilities. A Membase cluster allows nodes to be added and removed during live operation and provides client-initiated mechanisms for rebalancing data. The data is naturally balanced through client-side logic that evenly distributes new data.

A Membase cluster features nodes that are all identical in software, however one node is elected to operate as a cluster manager. The cluster manager performs aggregation, consensus building, and cross-node decisions.

The distribution of data to nodes is determined by a vBucket value of 1 to 1024. The vBucket value is automatically generated by a client library hashing algorithm. Each node in a Membase cluster will host one or more vBuckets and a map of vBuckets to specific nodes is maintained. When the cluster is resized, these vBuckets can easily be redistributed to the available nodes.

When a client performs a write to a key, the vBucket of the key is used to reference a map of the nodes. The node responsible for the vBucket is sent the write and the data is immediately replicated from the owner to any replicas of that node. At the same time, the value is cached in memory.

Document Stores: SimpleDB

Document data stores feature the ability to store documents into a database. Documents consist of one or more named fields that are self-contained in each document. Most document stores offer different types of fields as well as special nested fields such as lists or arrays. The structure of documents (their fields) is dynamic and can be freely modified by the client with the ability to add or remove fields of existing documents. Since the documents are self-contained, their fields are usually stored and distributed together as individual documents in the backend storage system.

SimpleDB was introduced early by Amazon as a Cloud Service in 2007. It is part of their cloud services platform that features many different cloud-related services including the very popular Amazon Elastic Compute Cloud (EC2) and Simple Storage Service (S3). SimpleDB was introduced to offer a simple scaling data store that is more powerful than S3 by being document-aware.

Data Model

SimpleDB features basic support for multiple domains of documents that contain multiple fields. Each domain is a collection of documents that is capable of supporting different forms of indexing. Domains can be independently queried for the documents and an appropriate index can be applied based on constraint conditions. The proprietary system manages automatic indexing.

Architecture

Each domain in SimpleDB may be stored on a different Amazon node. Furthermore, each domain is limited to one billion attributes and each item in a domain is limited to maximum of 256 attributes. These are very significant limitations since the maximum size of a given attribute for a given item is 1024 bytes. These limitations can be quickly exceeded by many commercial websites that create billions of new data items per month.

The limitations of SimpleDB require many clients to consider a data sharding strategy as their data grows. Unfortunately, this eliminates a major advantage that most data stores offer of a hands-off approach by requiring careful data partition strategies. Furthermore, writes to SimpleDB do not scale well as they are distributed to node replicas. Amazon also states that usage of SimpleDB will be limited and each query itself can only have a maximum of 2500 items taking no longer than 5 seconds to complete.

CouchDB

The CouchDB project was initially written in C++ and later in Erlang to be a fault-tolerant scalable database for the emerging web services. Since 2008, the project has been adopted and well-funded by the Apache Software Foundation. The CouchDB data store can be queried using javascript and the data models and views can be represented as simple JSON objects.

Data Model

The documents in a CouchDB database are part of a flat collection. Each document in CouchDB is identified by a unique id. A document can contain multiple named fields of different data type values such as strings, numbers, and dates. Complex data types of lists and associative maps are also supported.

Since the data is only semi-structured, CouchDB offers the ability to construct views which are themselves documents. A view offers a way for the client to filter and organize the documents for easier application usage. The views in CouchDB can be dynamically constructed and are considered strictly virtual in nature as they do not impact the storage or replication of underlying documents. Since views are virtual, many different views can be constructed for the same set of data, allowing for CouchDB to be flexible to the structural requirements of applications.

The views of CouchDB can be very dynamic as they are client-defined javascript functions that act as a map function in the map-reduce framework model. A view receives a document as input and can generate zero to multiple rows per document as output. The client has complete control over this view since the function itself is provided by the client.

Architecture

CouchDB describes itself as a peer-based distributed shared-nothing database system where nodes bi-directionally replicate changes to the data. The replication process involves the incremental exchange of only the data pieces that have been modified since the last replication process. If replication of a document fails, it is restarted on that document, ensuring that all data is replicated.

The data can also be replicated for offline (disconnected) database usage where connections are not possible or unreliable. Furthermore, partial replicas of data may be created through the use of selective javascript functions for additional offline processing, such as data mining of a large database, while at the same time maintaining some partial connection to the main database.

The replication of data will naturally create many conflicts for different documents. These conflicts are expected and are resolved by picking a single winning document deterministically where the document with the longest revision history wins. If a conflict of the conflict resolution exists, the same deterministic formula is used until all conflicts are resolved. Only winning documents are served through views while losing documents are eventually purged from the system.

View Indexing

The complete control of views creates significant overhead when using a view, especially when considering databases that have millions of documents. Since views are rarely modified in an application after they have been designed, CouchDB maintains fast querying of views by maintaining indexes of its views. Views are kept in a special type of document in the CouchDB system called a design document. An index of documents that are applicable to the view is maintained and entries for stale documents, documents that have been updated since the last view refresh, are updated in the order in which they appear on physical disk to minimize hardware head seeking and latency. When new documents are evaluated, they are appropriately indexed in the view.

Multiple clients may use the same view at the same time without problems. When the client uses the view, documents are fetched by the view and freshly evaluated. The old row values of the document in the view indexes are removed and the fresh evaluation will insert new values in their place depending on the view's evaluation of the document.

ACID Properties

Unlike many data stores that sacrifice some properties of ACID, CouchDB seeks to feature all Atomic, Consistent, Isolated, and Durable properties. All document insertions, updates, and deletions are serialized with the exception of binary blobs. CouchDB follows the Mutli-Version Concurrency Control (MVCC) model to ensure that Client reads are never blocked regardless of whether they are reading the same document. Furthermore, MVCC ensures that the client will see a consistent snapshot of the database from the beginning to the end of the read operation.

Each document has a unique DocID name as well as a Sequence ID. The sequence id is a sequential number that is incremented on each update and can be used to identify the incremental changes to the database. B-tree indexes are kept and updated through append-only updates as documents are saved.

The commit of CouchDB documents occurs in two stages. In the first stage, the document data and its associated index updates are synchronously flushed to the disk. In the second stage, the updated database header is written to two consecutive identical chunks (to make up the first 4kb of a file) and then flushed synchronously to the disk. If a crash occurs in the first stage, the partially flushed data is simply forgotten. If a crash occurs in the second stage, a surviving copy of the previous identical headers remain and the header area can be recovered.

MongoDB

MongoDB is a C++ data store that claims to be a scalable, high-performance, open source, document-oriented database, sponsored by the company 10gen. It features collections of documents that can be indexed and retrieved through lockless queries. MongoDB shares similarities with CouchDB but differs significantly in its usage of distributed data.

Data Model

Unlike CouchDB and its MVCC model, MongoDB offers a more traditional update-in-place method of storing data. Since MongoDB offers in-place updates, higher rates of update are possible and complexities involving conflict resolution are simpler. Documents are stored in MongoDB in BSON format, a format similar to JSON with more typed support, and MongoDB offers similar nested attributes as CouchDB.

The update-in-place nature of MongoDB also allows for partial data update and insertion. For example, a document can have an array of values that can be incremented, appended, and removed in-place without having to send the entire data set to the server. All operations on fields in MongoDB are atomic and local consistency is offered on the most up-to-date primary document copy. Conditional updates are also possible, where an update is only performed if the expected value of a field still exists.

The query model of MongoDB is far more traditional compared to other document stores. All attributes can be indexed and those indices are used to serve traditional dynamic queries using a query optimizer, similar to a traditional database management system. Unlike CouchDB, MongoDB does not necessitate the use of mapping functions in the map-reduce model, but the functionality is still available for additional data processing if desired by the client.

Architecture

MongoDB Architecture Diagram

MongoDB features replication as a means of providing failover and reliability instead of scaling as found in CouchDB. Since replication is not used for scaling, a data sharding strategy is automatically implemented by MongoDB that resembles the partition strategies of Bigtable. Data is chunked and evenly distributed to a cluster of nodes that can be dynamically expanded. Cluster metadata is stored in a two-phase commit database on config servers that have a custom replication model. Each server has a complete copy of all shard metadata and if a config server fails, the cluster goes into a read-only mode until it is recovered. It is important to note that data can still be read and written from the MongoDB during a config server failure, but additional sharding will be unavailable.

The figure illustrates the sharding architecture of MongoDB. Client requests come into Mongo routing processes that access config servers to determine the location of data and fulfil the client request. Shard keys are used to indicate where data resides and the data itself is stored order-preserving with adjacent data of the same shard key stored on the server.

Atomicity & Durability

Atomicity is provided for individual documents to support concurrent modification of single documents. Durability is provided through a traditional journaling system that can be enabled on initialization.

Terrastore

Terrastore is a document store that utilizes the Terracotta clustering technologies to build a scalable Java-based database. Data is automatically partitioned over several nodes in a master-slave arrangement with hot failover servers and automated re-balancing of nodes.

Data Model

Similar to CouchDB and MongoDB, Terrastore features JSON-based documents that can be placed into buckets to operate as collections. The documents are partitioned with a hash distributed to different Terrastore nodes. If a node becomes unbalanced or has increased load, the data can be redistributed automatically. Dynamic queries are supported in a manner similar to MongoDB and a map-reduce mechanism is available for more advanced operations on documents.

The data itself is partially structured (as a document) and guarantees single-document consistency but does not provide ACID transactions. Thus, a read will always fetch the latest version of the document from the database.

Architecture

Terrastore provides automatic partitioning of data to the many Terrastore servers in a Terrastore cluster. A client request comes into a Terrastore server where it is handled between the servers in the cluster. If the server does not own the document being requested, the request is routed to the proper server. The management of the servers in the cluster and the storage of a durable copy of the data is handled by a master server. Multiple clusters, each containing at least one master, may be combined into an ensemble for better scaling.

When a document is initially read, a copy of it is retrieved from the master and locally cached in memory on the servers. This allows for higher throughput of reads. However, every write must be performed by both the server that owns the document and the master. This potential bottleneck is mitigated by automated data redistribution and load balancing.

What's Next?

In the next article, I will examine several Extensible Record Stores (also known as Wide-Column databases) including BigTable, Cassandra, HBase.

Brownie Thoughts

Sunday, February 17, 2013