Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

multiple single node database systems (PostgreSQL) using Hadoop as the task

coordinator and network communication layer. Queries are expressed in SQL but

their execution are parallelized across nodes using the MapReduce framework,

however, as much of the single node query work as possible is pushed inside of the

corresponding node databases. Thus, HadoopDB tries to achieve fault tolerance and

the ability to operate in heterogeneous environments by inheriting the scheduling

and job tracking implementation from Hadoop. Parallely, it tries to achieve the

performance of parallel databases by doing most of the query processing inside

the database engine. Figure 9.15 illustrates the architecture of HadoopDB which

consists of two layers: (1) A data storage layer or the Hadoop Distributed File

System (HDFS) [ 26 ]. (2) A data processing layer or the MapReduce Framework.

In this architecture, HDFS is a block-structured file system managed by a central

NameNode . Individual files are broken into blocks of a fixed size and distributed

across multiple DataNodes in the cluster. The NameNode maintains metadata about

the size and location of blocks and their replicas. The MapReduce Framework

follows a simple master-slave architecture. The master is a single JobTracker and

the slaves or worker nodes are Ta s k Tra cke rs .The JobTracker handles the runtime

scheduling of MapReduce jobs and maintains information on each TaskTracker's

load and available resources. The Database Connector is the interface between

independent database systems residing on nodes in the cluster and TaskTrackers.

The Connector connects to the database, executes the SQL query and returns results

as key-value pairs. The Catalog component maintains metadata about the databases,

their location, replica locations and data partitioning properties. The Data Loader

component is responsible for globally repartitioning data on a given partition key

upon loading and breaking apart single node data into multiple smaller partitions

or chunks. The SMS planner extends the HiveQL translator [ 222 ] (Sect. 9.4 ) and

transforms SQL into MapReduce jobs that connect to tables stored as files in HDFS.

Abouzeid et al. [ 59 ] have demonstrated HadoopDB in action running the following

two different application types:

1. A semantic web application that provides biological data analysis of protein

sequences.

2. A classical business data warehouse.

Jaql

Jaql [ 32 ] is a query language which is designed for Javascript Object Notation

(JSON), 4 a data format that has become popular because of its simplicity and

modeling flexibility. JSON is a simple, yet flexible way to represent data that

ranges from flat, relational data to semi-structured, XML data. Jaql is primarily

4 http://www.json.org/ .

Search WWH ::

Custom Search

Home