Databases Reference
In-Depth Information
Storage
For map-reduce to work, the data needs to be node aware. In other words, the data needs
to be available in a distributed fashion to serve each processing node where map and
reduce jobs are executed. The data expected by map-reduce is not stored as we normally
store relational data (the entire record at one place); instead relevant data is grouped
together and stored in chunks, which are then divided among nodes. Each such data set
is identified through key-value pairs.
The standard storage mechanism is a distributed file system having the following
characteristics:
Fault tolerance: Since data is distributed across nodes, the
storage system should be highly fault tolerant.
Extreme scalability: To accommodate big data scale
considerations, the storage system should be highly scalable.
Write once and read many times: The workloads for big data
are less transaction oriented and more analysis oriented, hence
assuming data will remain unchanged after it is written, there
should be provision to achieve high data throughput.
Locality of computation: Moving voluminous data around to do
computations introduces severe drags on performance. Instead,
moving computation (map-reduce) to data results in faster
performance. The file system should have features to facilitate this.
HDFS (the distributed file system in Hadoop-based architecture) provides all the
above-mentioned functionalities. Unlike a database, HDFS can store and retrieve data but
not index it. So, simple random access to data is not possible through the HDFS. HBase
is another component in the Hadoop-based architecture leveraging HDFS as a storage
system and provides a column-oriented database designed to store massive amounts of
data. Because it creates indexes, HBase offers fast, random access to its contents, though
with simple queries. For complex operations, HBase acts as both a source and a sink
(destination for computed data) for Hadoop map-reduce.
Hive is another component in the Hadoop-based architecture that provides a data
warehousing and analysis-like data store. Hive is built on top of Hadoop providing table-
based abstraction over HDFS, which makes it easy to load structured data.
NoSQL databases serve as important components within the SMAQ stack wherein
they have built-in map-reduce features that allow computation to be parallelized over
distributed data nodes. Hadoop-based systems are most often used for batch-oriented
data collection purposes, whereas the NoSQL stores are more aligned to provide faster
query response to live applications.
In chapters 5 and 6 we will discuss several of these NoSQL data stores and data
modeling approaches at length.
 
Search WWH ::




Custom Search