Defining the Schema - HBase Essentials

Database Reference

In-Depth Information

Once we have answers, certain practices are followed to ensure optimal table design.

Some of the design practices are as follows:

• Data for a given column family goes into a single store on HDFS. This store

might consist of multiple HFiles, which eventually get converted to a single

HFile using compaction techniques.

• Columns in a column family are also stored together on the disk, and

the columns with different access patterns should be kept in different

column families.

• If we design tables with fewer columns and many rows (a tall table),

we might achieve O(1) operations but also compromise with atomicity.

• Access patterns should be completed in a single API call. Multiple calls

are not a good sign of design.

We not only need to design the table schema to store data in a column-family layout

but also consider the read/write pattern for the table, that is, how the application is

going to access the data from an HBase table. Similarly, rowkeys should be designed

based on the access patterns, as regions represent a range of rows based on the

rowkeys and the HFiles store the rows sorted on the disk. Hence, the rowkey is a

crucial element to the performance of I/O interactions with HBase.

HBase doesn't support cross-row transactions, so the client code

should avoid any kind of transactional logic to support simplicity.

HBase drives the design from BigTable of Google as one-row-per-account which

might easily hold multiple terabytes in a single row with no problems or with a poor

design. However, the same information can also be stored in a tall table (lots of rows

with fewer columns), which also provide performance beneits. This performance

beneit also comes with a cost of atomicity. The physical storage for both the table

designs is essentially the same.

Search WWH ::

Custom Search

Home