Database Reference
In-Depth Information
<
2.3.5 e FFeCtive D ata P laCement
In the basic implementation of the Hadoop project, the objective of the data place-
ment policy is to achieve good load balance by distributing the data evenly across
the data servers, independently of the intended use of the data. This simple data
placement policy works well with most Hadoop applications that access just a sin-
gle file. However, there are some other applications that process data from multiple
files, which can get a significant boost in performance with customized strategies.
In these applications, the absence of data colocation increases the data-shuffling
costs, increases the network overhead, and reduces the effectiveness of data parti-
tioning. For example, log processing is a very common usage scenario for Hadoop
framework. In this scenario, data are accumulated in batches from event logs such
as: clickstreams, phone call records, application logs, or a sequences of transactions.
Each batch of data is ingested into Hadoop and stored in one or more HDFS files
at regular intervals. Two of the most common operations in log analysis of these
applications are (1) joining the log data with some reference data and (2) sessioniza-
tion, that is, computing user sessions. The performance of such operations can be
significantly improved if they utilize the benefits of data colocation. CoHadoop [51]
is a lightweight extension to Hadoop, which is designed to enable colocating related
files at the file system level while at the same time retaining the good load balancing
and fault tolerance properties. It introduces a new file property to identify related
data files and modify the data placement policy of Hadoop to colocate copies of
those related files in the same server. These changes are designed in a way to retain
the benefits of Hadoop, including load balancing and fault tolerance. In principle,
CoHadoop provides a generic mechanism that allows applications to control data
placement at the file-system level. In particular, a new file-level property called a
locator is introduced, and the Hadoop's data placement policy is modified so that
it makes use of this locator property. Each locator is represented by a unique value
(ID) where each file in HDFS is assigned to at most one locator and many files can be
assigned to the same locator. Files with the same locator are placed on the same set
of datanodes, whereas files with no locator are placed via Hadoop's default strategy.
It should be noted that this colocation process involves all data blocks, including
replicas. Figure 2.9 shows an example of colocating two files, A and B , via a common
locator. All of A 's two HDFS blocks and B 's three blocks are stored on the same set
of datanodes. To manage the locator information and keep track of co-located files,
CoHadoop introduces a new data structure, the locator table , which stores a map-
ping of locators to the list of files that share this locator. In practice, the CoHadoop
extension enables a wide variety of applications to exploit data colocation by simply
specifying related files such as: colocating log files with reference files for joins,
colocating partitions for grouping and aggregation, colocating index files with their
data files and colocating columns of a table.
2.3.6 P iPelining anD s treaming o Perations
The original implementation of the MapReduce framework has been designed in a
way that the entire output of each map and reduce task to be materialized into a local
Search WWH ::




Custom Search