Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

HFiles. The get command is instantaneously processed and the appropriate data

returned to the client.

Over time, as the smaller HFiles accumulate, the worker node runs a major

compaction that merges the smaller HFiles into one large HFile. During the

major compaction, the deleted entries and the tombstone markers are permanently

removed from the files.

Use Cases for HBase

As described in Google's Bigtable paper, a common use case for a data store such

as HBase is to store the results from a web crawler. Using this paper's example, the

row com.cnn.www , for example, corresponds to a website URL, www.cnn.com . A

column family, called anchor , is defined to capture the website URLs that provide

links to the row's website. What may not be an obvious implementation is that

those anchoring website URLs are used as the column qualifiers. For example, if

sportsillustrated . cnn.com provides a link to www.cnn.com , the column

qualifier is sportsillustrated.cnn .com . Additional websites that provide

links to www.cnn.com appear as additional column qualifiers. The value stored in

the cell is simply the text on the website that provides the link. Here is how the

CNN example may look in HBase following a get operation.

hbase> get 'web_table', 'com.cnn.www', {VERSIONS => 2}

COLUMN CELL

anchor:sportsillustrated.cnn.com timestamp=1380224620597,

value=cnn

anchor:sportsillustrated.cnn.com timestamp=1380224000001,

value=cnn.com

anchor:edition.cnn.com timestamp=1380224620597,

value=cnn

Additional results are returned for each corresponding website that provides a

link to www.cnn.com . Finally, an explanation is required for using com.cnn.www

for the row instead of www.cnn.com . By reversing the URLs, the various suffixes

( .com , .gov , or .net ) that correspond to the Internet's top-level domains are

stored in order. Also, the next part of the domain name ( cnn ) is stored in order.

So, all of the cnn.com websites could be retrieved by a scan with the STARTROW of

com.cnn and the appropriate STOPROW .

Search WWH ::

Custom Search

Home