Databases Reference
In-Depth Information
tasks directly, it improves I/O and overall performance. Under normal gzip-based compression
algorithms, parallelizing spilt gzip segments poses a problem and so these spilt portions need to be
processed by a single mapper. If a single mapper is used the parallelization effort is affected. With
bzip2, this can be avoided and split portions can be sent to different mappers but the decompression
is very CPU intensive and therefore the gains in I/O are lost in CPU time. LZO comes as a good
optimal middle ground where the sizes and decompression speeds are optimal. Learn more about
splittable LZO online at https://github.com/kevinweil/hadoop-lzo .
File Block Size
HDFS, the underlying distributed fi lesystem in Hadoop, allows the storage of very large fi les. A
default block size in HDFS is about 64 MB in size. If your cluster is small and the data size is large,
a large number of map tasks would be spawned for the default block size. For example, 120 GBs of
input would lead to 1,920 map tasks. This can be derived by a simple calculation as follows:
(120 * 1024)/64
Thus, increasing block size seems logical in small clusters. However, it should not be increased to a
point that all nodes in a cluster are not used.
Parallel Copying
Maps outputs are copied over to reducers. In cases where the output of the map task is large,
the copying over of values can be done in parallel by multiple threads. Increasing the threads
increases the CPU usage but reduces latency. The default number of such threads is set to 5. You
can increase the number by setting the following property:
mapred.reduce.parallel.copies
HBASE COPROCESSORS
HBase coprocessors are inspired by the idea of coprocessors in Google Bigtable. A few simple
processes like counting, aggregating, and such can be pushed up to the server to enhance
performance. The idea of coprocessors achieves this.
Three interfaces in HBase — Coprocessor, RegionObserver, and Endpoint — implement the
coprocessor framework in a fl exible manner. The idea behind Coprocessor and RegionObserver is
that you can insert user code by overriding upcall methods from these two related interfaces. The
coprocessor framework handles the details of invoking the upcalls. More than one Coprocessor or
RegionObserver can be loaded to extend function. They are chained to execute sequentially. These
sequential coprocessors are ordered on the basis of assigned priorities.
Through an endpoint on the server side and dynamic RPC provided by the client library, you can defi ne
your own extensions to HBase RPC transactions exchanged between clients and the region servers.
Search WWH ::




Custom Search