Databases Reference
In-Depth Information
approach where you can logically partition the data write process from the access and analytics and
use two separate databases for each of the tasks.
If scalability implies large data becoming available at an incredibly fast pace, for example stock
market tick data or advertisement click tracking data, then column-family stores alone may not
provide a complete solution. It's prudent to store the massively growing data in these stores and
manipulate them using MapReduce operations for batch querying and data mining, but you may
need something more nimble for fast writes and real-time manipulation. Nothing is faster than
manipulating the data in memory and so leveraging NoSQL options that keep data in memory and
fl ush it to disk when it fi lls the available capacity are probably good choices. Both MongoDB and
Redis follow this strategy. Currently, MongoDB uses mmap and Redis implements a custom mapping
from memory to disk. However, both MongoDB and Redis, have actively been re-engineering their
memory mapping feature and things will continue to evolve. Using MongoDB or Redis with HBase or
Hypertable makes a good choice for a system that needs fast real-time data manipulation and a store
for extensive data mining. Memcached and Membase can be used in place of MongoDB or Redis.
Memcached and Membase act as a layer of fast and effi cient cache, and therefore supplement well on
top of column-family stores. Membase has been used effectively with Hadoop-based systems for such
use cases. With the merger of Membase and CouchDB, a well integrated NoSQL product with both
fast cache-centric features and distributed scalable storage-centric features is likely to emerge.
Although scalability is very important if your data requirements grow to the size of Google's or
Facebook's, not all applications become that large. Scalable systems are probably relevant for cases
much smaller than these widespread systems but sometimes an attempt to make things scalable can
become an exercise in over-engineering. You certainly want to avoid unnecessary complexity.
In many systems, data integrity and transactional consistency are more important than any other
requirements. Is NoSQL an option for such systems?
Transactional Integrity and Consistency
Transactional integrity is relevant only when data is modifi ed, updated, created, and deleted.
Therefore, the question of transactional integrity is not pertinent in pure data warehousing and
mining contexts. This means that batch-centric Hadoop-based analytics on warehoused data is also
not subject to transactional requirements.
Many data sets like web traffi c log fi les, social networking status updates (including tweets or buzz),
advertisement click-through imprints, road-traffi c data, stock market tick data, and game scores are
primarily, if not completely, written once and read multiple times. Data sets that are written once
and read multiple times have limited or no transactional requirements.
Some data sets are updated and deleted, but often these modifi cations are limited to a single
item and not a range within the data set. Sometimes, updates are frequent and involve a range
operation. If range operations are common and integrity of updates is required, an RDBMS is the
best choice. If atomicity at an individual item level is suffi cient, then column-family databases,
document databases, and a few distributed key/value stores can guarantee that. If a system needs
transactional integrity but could accommodate a window of inconsistency, eventual consistency is
a possibility.
Search WWH ::




Custom Search