Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

shuffling costs, increases the network overhead and reduces the effectiveness of

data partitioning. For example, log processing is a very common usage scenario

for Hadoop framework. In this scenario, data are accumulated in batches from

event logs such as: clickstreams, phone call records, application logs or a sequences

of transactions. Each batch of data is ingested into Hadoop and stored in one or

more HDFS files at regular intervals. Two of the most common operations in log

analysis of these applications are (1) joining the log data with some reference

data and (2) sessionization, i.e., computing user sessions. The performance of

such operations can be significantly improved if they utilize the benefits of data

colocation. CoHadoop [ 129 ] is a lightweight extension to Hadoop which is designed

to enable colocating related files at the file system level while at the same time

retaining the good load balancing and fault tolerance properties. It introduces a new

file property to identify related data files and modify the data placement policy of

Hadoop to colocate copies of those related files in the same server. These changes

are designed in a way to retain the benefits of Hadoop, including load balancing and

fault tolerance. In principle, CoHadoop provides a generic mechanism that allows

applications to control data placement at the file-system level. In particular, a new

file-level property called a locator is introduced and the Hadoop's data placement

policy is modified so that it makes use of this locator property. Each locator is

represented by a unique value (ID) where each file in HDFS is assigned to at

most one locator and many files can be assigned to the same locator. Files with the

same locator are placed on the same set of datanodes, whereas files with no locator

are placed via Hadoop's default strategy. It should be noted that this colocation

process involves all data blocks, including replicas. Figure 9.9 shows an example of

colocating two files, A and B , via a common locator. All of A 's two HDFS blocks

and B 's three blocks are stored on the same set of datanodes. To manage the locator

information and keep track of collocated files, CoHadoop introduces a new data

structure, the locator table , which stores a mapping of locators to the list of files

that share this locator. In practice, the CoHadoop extension enables a wide variety

of applications to exploit data colocation by simply specifying related files such as:

colocating log files with reference files for joins, collocating partitions for grouping

and aggregation, colocating index files with their data files and colocating columns

of a table.

Pipelining and Streaming Operations

The original implementation of the MapReduce framework has been designed in a

way that the entire output of each map and reduce task to be materialized into a local

file before it can be consumed by the next stage. This materialization step allows

for the implementation of a simple and elegant checkpoint/restart fault tolerance

mechanism. The MapReduce Online approach [ 108 , 109 ] has been proposed as a

modified architecture of the MapReduce framework in which intermediate data is

pipelined between operators while preserving the programming interfaces and fault

Search WWH ::

Custom Search

Home