Performance Optimization - HBase Design Patterns - page 102

Database Reference

In-Depth Information

• Using MapReduce to insert data in parallel (this approach also uses the Java

API), as shown in the following diagram:

mapreduce

mapreduce

mapreduce

put()

put()

put()

HBase

RegionServer

HBase

RegionServer

HBase

RegionServer

• Using MapReduce to generate HBase store files in parallel in bulk and then

import them into HBase directly. (This approach does not require the use of

the API; it does not require code and is very efficient.)

mapreduce

mapreduce

mapreduce

1- create()

1- create()

1- create()

HFile on

HDFS

HFile on

HDFS

HFile on

HDFS

2- import()

2- import()

2- import()

HBase

RegionServer

HBase

RegionServer

HBase

RegionServer

On comparing the three methods speed wise, we have the following

order:

Java client < MapReduce insert < HBase file import

The Java client and MapReduce use HBase APIs to insert data.

MapReduce runs on multiple machines and can exploit parallelism.

However, both of these methods go through the write path in HBase.

Importing HBase files directly, however, skips the usual write path.

HBase files already have data in the correct format that HBase

understands. That's why importing them is much faster than using

MapReduce and the Java client.

Next Page

HBase Design Patterns

Search WWH ::

Custom Search

Home