Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

an HBase environment and illustrating their use. However, in a

production environment, the HBase Java API could be used to program

the desired operations and the conditions in which to execute the

operations.

• Column family and column qualifier names: It is important to keep

the name lengths of the column families and column qualifiers as short as

possible. Although short names tend to go against conventional wisdom

about using meaningful, descriptive names, the names of column family

name and the column qualifier are stored as part of the key of each key/

value pair. Thus, every additional byte added to a name over each row can

quickly add up. Also, by default, three copies of each HDFS block are

replicated across the Hadoop cluster, which triples the storage

requirement.

• Defining rows: The definition of the row is one of the most important

aspects of the HBase table design. In general, this is the main mechanism

to perform read/write operations on an HBase table. The row needs to be

constructed in such a way that the requested columns can be easily and

quickly retrieved.

• Avoid creating sequential rows: A natural tendency is to create rows

sequentially. For example, if the row key is to have the customer

identification number, and the customer identification numbers are

created sequentially, HBase may run into a situation in which all the new

users and their data are being written to just one region, which is not

distributing the workload across the cluster as intended [35]. An approach

to resolve such a problem is to randomly assign a prefix to the sequential

number.

• Versioning control: HBase table options that can be defined during

table creation or altered later control how long a version of a cell's

contents will exist. There are options for TimeToLive (TTL) after which

any older versions will be deleted. Also, there are options for the minimum

and maximum number of versions to maintain.

• Zookeeper: HBase uses Apache Zookeeper to coordinate and manage the

various regions running on the distributed cluster. In general, Zookeeper

is “a centralized service for maintaining configuration information,

naming, providing distributed synchronization, and providing group

services. All of these kinds of services are used in some form or another by

Search WWH ::

Custom Search

Home