Database Reference
In-Depth Information
shuffling costs, increases the network overhead and reduces the effectiveness of
data partitioning. For example, log processing is a very common usage scenario
for Hadoop framework. In this scenario, data are accumulated in batches from
event logs such as: clickstreams, phone call records, application logs or a sequences
of transactions. Each batch of data is ingested into Hadoop and stored in one or
more HDFS files at regular intervals. Two of the most common operations in log
analysis of these applications are (1) joining the log data with some reference
data and (2) sessionization, i.e., computing user sessions. The performance of
such operations can be significantly improved if they utilize the benefits of data
colocation. CoHadoop [ 129 ] is a lightweight extension to Hadoop which is designed
to enable colocating related files at the file system level while at the same time
retaining the good load balancing and fault tolerance properties. It introduces a new
file property to identify related data files and modify the data placement policy of
Hadoop to colocate copies of those related files in the same server. These changes
are designed in a way to retain the benefits of Hadoop, including load balancing and
fault tolerance. In principle, CoHadoop provides a generic mechanism that allows
applications to control data placement at the file-system level. In particular, a new
file-level property called a locator is introduced and the Hadoop's data placement
policy is modified so that it makes use of this locator property. Each locator is
represented by a unique value (ID) where each file in HDFS is assigned to at
most one locator and many files can be assigned to the same locator. Files with the
same locator are placed on the same set of datanodes, whereas files with no locator
are placed via Hadoop's default strategy. It should be noted that this colocation
process involves all data blocks, including replicas. Figure 9.9 shows an example of
colocating two files, A and B , via a common locator. All of A 's two HDFS blocks
and B 's three blocks are stored on the same set of datanodes. To manage the locator
information and keep track of collocated files, CoHadoop introduces a new data
structure, the locator table , which stores a mapping of locators to the list of files
that share this locator. In practice, the CoHadoop extension enables a wide variety
of applications to exploit data colocation by simply specifying related files such as:
colocating log files with reference files for joins, collocating partitions for grouping
and aggregation, colocating index files with their data files and colocating columns
of a table.
Pipelining and Streaming Operations
The original implementation of the MapReduce framework has been designed in a
way that the entire output of each map and reduce task to be materialized into a local
file before it can be consumed by the next stage. This materialization step allows
for the implementation of a simple and elegant checkpoint/restart fault tolerance
mechanism. The MapReduce Online approach [ 108 , 109 ] has been proposed as a
modified architecture of the MapReduce framework in which intermediate data is
pipelined between operators while preserving the programming interfaces and fault
Search WWH ::




Custom Search