Performance Optimization - HBase Design Patterns - page 112

Database Reference

In-Depth Information

As you can see, the inserts go to different regions. So, on a HBase cluster with many

region servers, the load will be spread across the cluster. This is because we have

presplit the table into regions. Here are some questions to test your understanding.

Run the same ImportTsv command again and see how many records are in the table.

Do you get duplicates? Try to find the answer and explain why that is the correct

answer, then check these in the topic's GitHub repository.

Bulk import scenarios

Here are a few bulk import scenarios:

Scenario

Methods

Notes

The data is already in

HDFS and needs to be

imported into HBase.

The two methods that can be

used to do this are as follows:

• If the ImportTsv tool

can work for you,

then use it as it will

save time in writing

custom MapReduce

code.

• Sometimes, you might

have to write a custom

MapReduce job to

import (for example,

complex time series

data, doing data

mapping, and so on).

It is probably a good idea

to presplit the table before

a bulk import. This spreads

the insert requests across

the cluster and results in a

higher insert rate.

If you are writing a

custom MapReduce job,

consider using a high-level

MapReduce platform such

as Pig or Hive. They are

much more concise to write

than the Java code.

The data is in another

database (RDBMs/NoSQL)

and you need to import it

into HBase.

Use a utility such as Sqoop

to bring the data into HDFS

and then use the tools

outlined in the first scenario.

Avoid writing MapReduce

code that directly queries

databases. Most databases

cannot handle many

simultaneous connections. It

is best to bring the data into

Hadoop (HDFS) first and

then use MapReduce.

Next Page

HBase Design Patterns

Search WWH ::

Custom Search

Home