Choosing Among NoSQL Flavors - Professional NoSQL - page 274

Databases Reference

In-Depth Information

approach where you can logically partition the data write process from the access and analytics and

use two separate databases for each of the tasks.

If scalability implies large data becoming available at an incredibly fast pace, for example stock

market tick data or advertisement click tracking data, then column-family stores alone may not

provide a complete solution. It's prudent to store the massively growing data in these stores and

manipulate them using MapReduce operations for batch querying and data mining, but you may

need something more nimble for fast writes and real-time manipulation. Nothing is faster than

manipulating the data in memory and so leveraging NoSQL options that keep data in memory and

fl ush it to disk when it fi lls the available capacity are probably good choices. Both MongoDB and

Redis follow this strategy. Currently, MongoDB uses mmap and Redis implements a custom mapping

from memory to disk. However, both MongoDB and Redis, have actively been re-engineering their

memory mapping feature and things will continue to evolve. Using MongoDB or Redis with HBase or

Hypertable makes a good choice for a system that needs fast real-time data manipulation and a store

for extensive data mining. Memcached and Membase can be used in place of MongoDB or Redis.

Memcached and Membase act as a layer of fast and effi cient cache, and therefore supplement well on

top of column-family stores. Membase has been used effectively with Hadoop-based systems for such

use cases. With the merger of Membase and CouchDB, a well integrated NoSQL product with both

fast cache-centric features and distributed scalable storage-centric features is likely to emerge.

Although scalability is very important if your data requirements grow to the size of Google's or

Facebook's, not all applications become that large. Scalable systems are probably relevant for cases

much smaller than these widespread systems but sometimes an attempt to make things scalable can

become an exercise in over-engineering. You certainly want to avoid unnecessary complexity.

In many systems, data integrity and transactional consistency are more important than any other

requirements. Is NoSQL an option for such systems?

Transactional Integrity and Consistency

Transactional integrity is relevant only when data is modifi ed, updated, created, and deleted.

Therefore, the question of transactional integrity is not pertinent in pure data warehousing and

mining contexts. This means that batch-centric Hadoop-based analytics on warehoused data is also

not subject to transactional requirements.

Many data sets like web traffi c log fi les, social networking status updates (including tweets or buzz),

advertisement click-through imprints, road-traffi c data, stock market tick data, and game scores are

primarily, if not completely, written once and read multiple times. Data sets that are written once

and read multiple times have limited or no transactional requirements.

Some data sets are updated and deleted, but often these modifi cations are limited to a single

item and not a range within the data set. Sometimes, updates are frequent and involve a range

operation. If range operations are common and integrity of updates is required, an RDBMS is the

best choice. If atomicity at an individual item level is suffi cient, then column-family databases,

document databases, and a few distributed key/value stores can guarantee that. If a system needs

transactional integrity but could accommodate a window of inconsistency, eventual consistency is

a possibility.

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home