Databases Reference
In-Depth Information
LEVERAGING BLOOM FILTERS
Bloom Filters were introduced in Chapter 13. Please review the defi nition if you aren't sure what
they are.
A get row call in HBase currently does a parallel N -way get of that row from all StoreFiles in a
region. This implies N reads requests from disk. Bloom Filters provide a lightweight in-memory
structure to reduce those N disk reads to only the fi les likely to contain that row.
Reads are in parallel and so the performance gains on an individual get is minimal. Also, read
performance is dominated by disk read latency. If you replace parallel get with serial get you would
see an impact of Bloom Filters on read latency.
Bloom Filters can be more heavyweight than your data. This is one big reason why they aren't
enabled by default.
SUMMARY
This chapter presented a few perspectives on tuning the performance of parallel MapReduce-
based processes. The MapReduce algorithm enables the processing of large amounts of data using
commodity hardware. Scaling MapReduce algorithms requires some clever confi guration. Optimal
confi guration of MapReduce tasks can tune performance.
The chapter presented a few generic performance-tuning tips but used Hadoop and the associated
set of tools for illustration.
Search WWH ::




Custom Search