Database Reference
In-Depth Information
an HBase environment and illustrating their use. However, in a
production environment, the HBase Java API could be used to program
the desired operations and the conditions in which to execute the
operations.
Column family and column qualifier names: It is important to keep
the name lengths of the column families and column qualifiers as short as
possible. Although short names tend to go against conventional wisdom
about using meaningful, descriptive names, the names of column family
name and the column qualifier are stored as part of the key of each key/
value pair. Thus, every additional byte added to a name over each row can
quickly add up. Also, by default, three copies of each HDFS block are
replicated across the Hadoop cluster, which triples the storage
requirement.
Defining rows: The definition of the row is one of the most important
aspects of the HBase table design. In general, this is the main mechanism
to perform read/write operations on an HBase table. The row needs to be
constructed in such a way that the requested columns can be easily and
quickly retrieved.
Avoid creating sequential rows: A natural tendency is to create rows
sequentially. For example, if the row key is to have the customer
identification number, and the customer identification numbers are
created sequentially, HBase may run into a situation in which all the new
users and their data are being written to just one region, which is not
distributing the workload across the cluster as intended [35]. An approach
to resolve such a problem is to randomly assign a prefix to the sequential
number.
Versioning control: HBase table options that can be defined during
table creation or altered later control how long a version of a cell's
contents will exist. There are options for TimeToLive (TTL) after which
any older versions will be deleted. Also, there are options for the minimum
and maximum number of versions to maintain.
Zookeeper: HBase uses Apache Zookeeper to coordinate and manage the
various regions running on the distributed cluster. In general, Zookeeper
is “a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group
services. All of these kinds of services are used in some form or another by
Search WWH ::




Custom Search