The New Information Management Paradigm - Big Data Imperatives

Databases Reference

In-Depth Information

Storage

For map-reduce to work, the data needs to be node aware. In other words, the data needs

to be available in a distributed fashion to serve each processing node where map and

reduce jobs are executed. The data expected by map-reduce is not stored as we normally

store relational data (the entire record at one place); instead relevant data is grouped

together and stored in chunks, which are then divided among nodes. Each such data set

is identified through key-value pairs.

The standard storage mechanism is a distributed file system having the following

characteristics:

• Fault tolerance: Since data is distributed across nodes, the

storage system should be highly fault tolerant.

• Extreme scalability: To accommodate big data scale

considerations, the storage system should be highly scalable.

• Write once and read many times: The workloads for big data

are less transaction oriented and more analysis oriented, hence

assuming data will remain unchanged after it is written, there

should be provision to achieve high data throughput.

• Locality of computation: Moving voluminous data around to do

computations introduces severe drags on performance. Instead,

moving computation (map-reduce) to data results in faster

performance. The file system should have features to facilitate this.

HDFS (the distributed file system in Hadoop-based architecture) provides all the

above-mentioned functionalities. Unlike a database, HDFS can store and retrieve data but

not index it. So, simple random access to data is not possible through the HDFS. HBase

is another component in the Hadoop-based architecture leveraging HDFS as a storage

system and provides a column-oriented database designed to store massive amounts of

data. Because it creates indexes, HBase offers fast, random access to its contents, though

with simple queries. For complex operations, HBase acts as both a source and a sink

(destination for computed data) for Hadoop map-reduce.

Hive is another component in the Hadoop-based architecture that provides a data

warehousing and analysis-like data store. Hive is built on top of Hadoop providing table-

based abstraction over HDFS, which makes it easy to load structured data.

NoSQL databases serve as important components within the SMAQ stack wherein

they have built-in map-reduce features that allow computation to be parallelized over

distributed data nodes. Hadoop-based systems are most often used for batch-oriented

data collection purposes, whereas the NoSQL stores are more aligned to provide faster

query response to live applications.

In chapters 5 and 6 we will discuss several of these NoSQL data stores and data

modeling approaches at length.

Search WWH ::

Custom Search

Home