Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

returns results as key-value pairs. The Catalog component maintains metadata about

the databases, their location, replica locations, and data-partitioning properties. The

Data Loader component is responsible for globally repartitioning data on a given

partition key upon loading and breaking apart single-node data into multiple smaller

partitions or chunks. The SMS planner extends the HiveQL translator [123] (Section

1.4.3) and transforms SQL into MapReduce jobs that connect to tables stored as files

in HDFS. Abouzeid et al. [4] have demonstrated HadoopDB in action running the

following two different application types:

1. A semantic web application that provides biological data analysis of protein

sequences.

2. A classical business data warehouse.

2.4.9 J aQl

Jaql* is a query language that is designed for Javascript Object Notation (JSON), † a

data format that has become popular because of its simplicity and modeling flexi-

bility. JSON is a simple, yet flexible way to represent data that ranges from flat,

relational data to semistructured, XML data. Jaql is primarily used to analyze large-

scale semistructured data. It is a functional, declarative query language that rewrites

high-level queries when appropriate into a low-level query consisting of map-reduce

jobs that are evaluated using the Apache Hadoop project. Core features include user

extensibility and parallelism. Jaql consists of a scripting language and compiler as

well as a runtime component [18]. It is able to process data with no schema or only

with a partial schema. However, Jaql can also exploit rigid schema information when

it is available, for both type checking and improved performance.

Jaql uses a very simple data model; a JDM value is an atom, an array, or a record.

Most common atomic types are supported, including strings, numbers, nulls, and

dates. Arrays and records are compound types that can be arbitrarily nested. In more

detail, an array is an ordered collection of values and can be used to model data

structures such as vectors, lists, sets, or bags. A record is an unordered collection of

name-value pairs and can model structs, dictionaries, and maps. Despite its simplic-

ity, JDM is very flexible. It allows Jaql to operate with a variety of different data

representations for both input and output, including delimited text files, JSON files,

binary files, Hadoop's sequence files, relational databases, key-value stores, or XML

documents. Functions are first-class values in Jaql. They can be assigned to a vari-

able and are high order in that they can be passed as parameters or used as a return

value. Functions are the key ingredient for reusability as any Jaql expression can be

encapsulated in a function, and a function can be parameterized in powerful ways.

Figure 2.16 depicts an example of a Jaql script that consists of a sequence of opera-

tors. The read operator loads raw data, in this case, from Hadoops Distributed File

System (HDFS), and converts it into Jaql values. These values are processed by the

countFields subflow, which extracts field names and computes their frequencies.

* http://code.google.com/p/jaql/.

† http://www.json.org/.

Search WWH ::

Custom Search

Home