Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

2.3.5 e FFeCtive D ata P laCement

In the basic implementation of the Hadoop project, the objective of the data place-

ment policy is to achieve good load balance by distributing the data evenly across

the data servers, independently of the intended use of the data. This simple data

placement policy works well with most Hadoop applications that access just a sin-

gle file. However, there are some other applications that process data from multiple

files, which can get a significant boost in performance with customized strategies.

In these applications, the absence of data colocation increases the data-shuffling

costs, increases the network overhead, and reduces the effectiveness of data parti-

tioning. For example, log processing is a very common usage scenario for Hadoop

framework. In this scenario, data are accumulated in batches from event logs such

as: clickstreams, phone call records, application logs, or a sequences of transactions.

Each batch of data is ingested into Hadoop and stored in one or more HDFS files

at regular intervals. Two of the most common operations in log analysis of these

applications are (1) joining the log data with some reference data and (2) sessioniza-

tion, that is, computing user sessions. The performance of such operations can be

significantly improved if they utilize the benefits of data colocation. CoHadoop [51]

is a lightweight extension to Hadoop, which is designed to enable colocating related

files at the file system level while at the same time retaining the good load balancing

and fault tolerance properties. It introduces a new file property to identify related

data files and modify the data placement policy of Hadoop to colocate copies of

those related files in the same server. These changes are designed in a way to retain

the benefits of Hadoop, including load balancing and fault tolerance. In principle,

CoHadoop provides a generic mechanism that allows applications to control data

placement at the file-system level. In particular, a new file-level property called a

locator is introduced, and the Hadoop's data placement policy is modified so that

it makes use of this locator property. Each locator is represented by a unique value

(ID) where each file in HDFS is assigned to at most one locator and many files can be

assigned to the same locator. Files with the same locator are placed on the same set

of datanodes, whereas files with no locator are placed via Hadoop's default strategy.

It should be noted that this colocation process involves all data blocks, including

replicas. Figure 2.9 shows an example of colocating two files, A and B , via a common

locator. All of A 's two HDFS blocks and B 's three blocks are stored on the same set

of datanodes. To manage the locator information and keep track of co-located files,

CoHadoop introduces a new data structure, the locator table , which stores a map-

ping of locators to the list of files that share this locator. In practice, the CoHadoop

extension enables a wide variety of applications to exploit data colocation by simply

specifying related files such as: colocating log files with reference files for joins,

colocating partitions for grouping and aggregation, colocating index files with their

data files and colocating columns of a table.

2.3.6 P iPelining anD s treaming o Perations

The original implementation of the MapReduce framework has been designed in a

way that the entire output of each map and reduce task to be materialized into a local

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home