Database Reference
In-Depth Information
HBase as a data sink
HBase as a data sink can also use the
TableOutputFormat
class that sets up a table
as an output to the MapReduce process as:
Job job = new Job(conf, "Writing data to the " + table);
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, table);
The preceding lines also uses an implicit write buffer set up by the
TableOutputFormat
class. The call to
context.write()
issues an internal
table.
put()
interface with the given instance of
Put
. The
TableOutputFormat
class also
takes care of calling
flushCommits()
when the job is complete.
In a typical MapReduce usage with HBase, a reducer is not usually needed as data
is already sorted and has unique keys to be stored in the HBase tables. If a reducer
is required for certain use cases, it should extend the
TableReducer
class that again
sets the input key and value types as:
static class HBaseSourceTestReduce extends TableReducer<.,.>
Also, set it in the job coniguration as:
TableMapReduceUtil.initTableReducerJob("customers", HBaseTestReduce.
class, job);
Here, the writes go to the region that is responsible for the rowkey that is being
written by the reduce task.
HBase as a data source and sink
This use case is the mix of both, that is, HBase as a data source as well as a data sink.
Let's look at the complete code example that uses HBase as a source as well as a sink.
This example reads the records from the
Customer
table for column-families
cf1
and
copies it to another table,
CustomerTableCopy
:
package com.ch4;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;