Database Reference
In-Depth Information
returns results as key-value pairs. The Catalog component maintains metadata about
the databases, their location, replica locations, and data-partitioning properties. The
Data Loader component is responsible for globally repartitioning data on a given
partition key upon loading and breaking apart single-node data into multiple smaller
partitions or chunks. The SMS planner extends the HiveQL translator [123] (Section
1.4.3) and transforms SQL into MapReduce jobs that connect to tables stored as files
in HDFS. Abouzeid et al. [4] have demonstrated HadoopDB in action running the
following two different application types:
1. A semantic web application that provides biological data analysis of protein
sequences.
2. A classical business data warehouse.
2.4.9 J aQl
Jaql* is a query language that is designed for Javascript Object Notation (JSON), a
data format that has become popular because of its simplicity and modeling flexi-
bility. JSON is a simple, yet flexible way to represent data that ranges from flat,
relational data to semistructured, XML data. Jaql is primarily used to analyze large-
scale semistructured data. It is a functional, declarative query language that rewrites
high-level queries when appropriate into a low-level query consisting of map-reduce
jobs that are evaluated using the Apache Hadoop project. Core features include user
extensibility and parallelism. Jaql consists of a scripting language and compiler as
well as a runtime component [18]. It is able to process data with no schema or only
with a partial schema. However, Jaql can also exploit rigid schema information when
it is available, for both type checking and improved performance.
Jaql uses a very simple data model; a JDM value is an atom, an array, or a record.
Most common atomic types are supported, including strings, numbers, nulls, and
dates. Arrays and records are compound types that can be arbitrarily nested. In more
detail, an array is an ordered collection of values and can be used to model data
structures such as vectors, lists, sets, or bags. A record is an unordered collection of
name-value pairs and can model structs, dictionaries, and maps. Despite its simplic-
ity, JDM is very flexible. It allows Jaql to operate with a variety of different data
representations for both input and output, including delimited text files, JSON files,
binary files, Hadoop's sequence files, relational databases, key-value stores, or XML
documents. Functions are first-class values in Jaql. They can be assigned to a vari-
able and are high order in that they can be passed as parameters or used as a return
value. Functions are the key ingredient for reusability as any Jaql expression can be
encapsulated in a function, and a function can be parameterized in powerful ways.
Figure 2.16 depicts an example of a Jaql script that consists of a sequence of opera-
tors. The read operator loads raw data, in this case, from Hadoops Distributed File
System (HDFS), and converts it into Jaql values. These values are processed by the
countFields subflow, which extracts field names and computes their frequencies.
* http://code.google.com/p/jaql/.
http://www.json.org/.
Search WWH ::




Custom Search