Database Reference
In-Depth Information
multiple single node database systems (PostgreSQL) using Hadoop as the task
coordinator and network communication layer. Queries are expressed in SQL but
their execution are parallelized across nodes using the MapReduce framework,
however, as much of the single node query work as possible is pushed inside of the
corresponding node databases. Thus, HadoopDB tries to achieve fault tolerance and
the ability to operate in heterogeneous environments by inheriting the scheduling
and job tracking implementation from Hadoop. Parallely, it tries to achieve the
performance of parallel databases by doing most of the query processing inside
the database engine. Figure 9.15 illustrates the architecture of HadoopDB which
consists of two layers: (1) A data storage layer or the Hadoop Distributed File
System (HDFS) [ 26 ]. (2) A data processing layer or the MapReduce Framework.
In this architecture, HDFS is a block-structured file system managed by a central
NameNode . Individual files are broken into blocks of a fixed size and distributed
across multiple DataNodes in the cluster. The NameNode maintains metadata about
the size and location of blocks and their replicas. The MapReduce Framework
follows a simple master-slave architecture. The master is a single JobTracker and
the slaves or worker nodes are Ta s k Tra cke rs .The JobTracker handles the runtime
scheduling of MapReduce jobs and maintains information on each TaskTracker's
load and available resources. The Database Connector is the interface between
independent database systems residing on nodes in the cluster and TaskTrackers.
The Connector connects to the database, executes the SQL query and returns results
as key-value pairs. The Catalog component maintains metadata about the databases,
their location, replica locations and data partitioning properties. The Data Loader
component is responsible for globally repartitioning data on a given partition key
upon loading and breaking apart single node data into multiple smaller partitions
or chunks. The SMS planner extends the HiveQL translator [ 222 ] (Sect. 9.4 ) and
transforms SQL into MapReduce jobs that connect to tables stored as files in HDFS.
Abouzeid et al. [ 59 ] have demonstrated HadoopDB in action running the following
two different application types:
1. A semantic web application that provides biological data analysis of protein
sequences.
2. A classical business data warehouse.
Jaql
Jaql [ 32 ] is a query language which is designed for Javascript Object Notation
(JSON), 4 a data format that has become popular because of its simplicity and
modeling flexibility. JSON is a simple, yet flexible way to represent data that
ranges from flat, relational data to semi-structured, XML data. Jaql is primarily
4 http://www.json.org/ .
Search WWH ::




Custom Search