Performance Tuning - Professional NoSQL

Databases Reference

In-Depth Information

tasks directly, it improves I/O and overall performance. Under normal gzip-based compression

algorithms, parallelizing spilt gzip segments poses a problem and so these spilt portions need to be

processed by a single mapper. If a single mapper is used the parallelization effort is affected. With

bzip2, this can be avoided and split portions can be sent to different mappers but the decompression

is very CPU intensive and therefore the gains in I/O are lost in CPU time. LZO comes as a good

optimal middle ground where the sizes and decompression speeds are optimal. Learn more about

splittable LZO online at https://github.com/kevinweil/hadoop-lzo .

File Block Size

HDFS, the underlying distributed fi lesystem in Hadoop, allows the storage of very large fi les. A

default block size in HDFS is about 64 MB in size. If your cluster is small and the data size is large,

a large number of map tasks would be spawned for the default block size. For example, 120 GBs of

input would lead to 1,920 map tasks. This can be derived by a simple calculation as follows:

(120 * 1024)/64

Thus, increasing block size seems logical in small clusters. However, it should not be increased to a

point that all nodes in a cluster are not used.

Parallel Copying

Maps outputs are copied over to reducers. In cases where the output of the map task is large,

the copying over of values can be done in parallel by multiple threads. Increasing the threads

increases the CPU usage but reduces latency. The default number of such threads is set to 5. You

can increase the number by setting the following property:

mapred.reduce.parallel.copies

HBASE COPROCESSORS

HBase coprocessors are inspired by the idea of coprocessors in Google Bigtable. A few simple

processes like counting, aggregating, and such can be pushed up to the server to enhance

performance. The idea of coprocessors achieves this.

Three interfaces in HBase — Coprocessor, RegionObserver, and Endpoint — implement the

coprocessor framework in a fl exible manner. The idea behind Coprocessor and RegionObserver is

that you can insert user code by overriding upcall methods from these two related interfaces. The

coprocessor framework handles the details of invoking the upcalls. More than one Coprocessor or

RegionObserver can be loaded to extend function. They are chained to execute sequentially. These

sequential coprocessors are ordered on the basis of assigned priorities.

Through an endpoint on the server side and dynamic RPC provided by the client library, you can defi ne

your own extensions to HBase RPC transactions exchanged between clients and the region servers.

Search WWH ::

Custom Search

Home