Database Reference
In-Depth Information
As you can see, the inserts go to different regions. So, on a HBase cluster with many
region servers, the load will be spread across the cluster. This is because we have
presplit the table into regions. Here are some questions to test your understanding.
Run the same ImportTsv command again and see how many records are in the table.
Do you get duplicates? Try to find the answer and explain why that is the correct
answer, then check these in the topic's GitHub repository.
Bulk import scenarios
Here are a few bulk import scenarios:
Scenario
Methods
Notes
The data is already in
HDFS and needs to be
imported into HBase.
The two methods that can be
used to do this are as follows:
• If the ImportTsv tool
can work for you,
then use it as it will
save time in writing
custom MapReduce
code.
• Sometimes, you might
have to write a custom
MapReduce job to
import (for example,
complex time series
data, doing data
mapping, and so on).
It is probably a good idea
to presplit the table before
a bulk import. This spreads
the insert requests across
the cluster and results in a
higher insert rate.
If you are writing a
custom MapReduce job,
consider using a high-level
MapReduce platform such
as Pig or Hive. They are
much more concise to write
than the Java code.
The data is in another
database (RDBMs/NoSQL)
and you need to import it
into HBase.
Use a utility such as Sqoop
to bring the data into HDFS
and then use the tools
outlined in the first scenario.
Avoid writing MapReduce
code that directly queries
databases. Most databases
cannot handle many
simultaneous connections. It
is best to bring the data into
Hadoop (HDFS) first and
then use MapReduce.
Search WWH ::




Custom Search