Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

and a relational database, this section presents considerable details about the

implementation and use of HBase.

The HBase design is based on Google's 2006 paper on Bigtable. This paper

described Bigtable as a “distributed storage system for managing structured data.”

Google used Bigtable to store Google product-specific data for sites such as Google

Earth, which provides satellite images of the world. Bigtable was also used to

store web crawler results, data for personalized search optimization, and website

clickstream data. Bigtable was built on top of the Google File System. MapReduce

was also utilized to process data into or out of a Bigtable. For example, the raw

clickstream data was stored in a Bigtable. Periodically, a scheduled MapReduce job

would run that would process and summarize the newly added clickstream data

and append the results to a second Bigtable [27].

The development of HBase began in 2006. HBase was included as part of a Hadoop

distribution at the end of 2007. In May 2010, HBase became an Apache Top

Level Project. Later in 2010, Facebook began to use HBase for its user messaging

infrastructure, which accommodated 350 million users sending 15 billion messages

per month [28].

HBase Architecture and Data Model

HBase is a data store that is intended to be distributed across a cluster of nodes.

Like Hadoop and many of its related Apache projects, HBase is built upon HDFS

and achieves its real-time access speeds by sharing the workload over a large

number of nodes in a distributed cluster. An HBase table consists of rows and

columns. However, an HBase table also has a third dimension, version, to maintain

the different values of a row and column intersection over time.

To illustrate this third dimension, a simple example would be that for any given

online customer, several shipping addresses could be stored. So, the row would

be indicated by a customer number. One column would provide the shipping

address. The value of the shipping address would be added at the intersection of

the customer number and the shipping address column, along with a timestamp

corresponding to when the customer last used this shipping address.

During a customer's checkout process from an online retailer, a website might use

such a table to retrieve and display the customer's previous shipping addresses. As

shown in Figure 10.6 , the customer can then select the appropriate address, add a

new address, or delete any addresses that are no longer relevant.

Search WWH ::

Custom Search

Home