Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management - page 80

Database Reference

In-Depth Information

FIGURE 2.17

Sample Jaql script. (From K. S. Beyer et al., PVLDB , 4(12), 1272-1283, 2011.)

systems (e.g., Hadoop's HDFS), database systems (e.g., DB2, Netezza, HBase), or

from streamed sources like the Web. Unlike federated databases, however, most of

the accessed data is stored within the same cluster and the I/O API describes data par-

titioning, which enables parallelism with data affinity during evaluation. Jaql derives

much of this flexibility from Hadoop's I/O API. It reads and writes many common

file formats (e.g., delimited files, JSON text, Hadoop sequence files). Custom adapt-

ers are easily written to map a data set to or from Jaql's data model. The input can

even simply be values constructed in the script itself. The Jaql interpreter evaluates

the script locally on the computer that compiled the script, but spawns interpreters

on remote nodes using MapReduce. The Jaql compiler automatically detects paral-

lelization opportunities in a Jaql script and translates it to a set of MapReduce jobs.

2.5 SAMPLE MapReduce-BASED APPLICATIONS

MapReduce-based systems are increasingly being used for large-scale data analysis.

There are several reasons for this such as [77]

•

The interface of MapReduce is simple yet expressive . Although MapReduce

only involves two functions map and reduce, a number of data analytical

tasks including traditional SQL query, data mining, machine learning, and

graph processing can be expressed with a set of MapReduce jobs.

•

MapReduce is flexible. . It is designed to be independent of storage systems

and is able to analyze various kinds of data, structured, and unstructured.

•

MapReduce is scalable . Installation of MapReduce can run over thousands

of nodes on a shared-nothing cluster while keeping to provide fine-grain

fault tolerance whereby only tasks on failed nodes need to be restarted.

These main advantages have triggered several research efforts with the aim of

applying the MapReduce framework for solving challenging data-processing prob-

lems on large-scale data sets in different domains. For example, [53] have proposed

an SQL-like query language for large-scale analysis of XML data on a MapReduce

platform, called MRQL (the Map - Reduce Q uery L anguage). The evaluation sys-

tem of MRQL leverages the relational query optimization techniques and compiles

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home