New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

13.2 High-Level Languages for Hadoop

Using Hadoop is not easy for end users not familiar with MapReduce.

Users need to write MapReduce code even for simple tasks like counting

or averaging. A solution for this is to use high-level languages, which allow

programmers to work at a higher level of abstraction than in Java or

other lower-level languages supported by Hadoop. The most commonly used

such languages are Hive and Pig Latin. Both of them are translated into

MapReduce jobs, resulting in programs that are much smaller than the

equivalent Java ones. Besides, these languages can be extended, for example,

writing user-defined functions in Java. This can work the other way round:

programs written in high-level languages can be embedded in other languages

as well.

13.2.1 Hive

Hive, developed at Facebook, brings the concepts of tables, columns,

partitions, and SQL to the Hadoop architecture, keeping the extensibility

and flexibility of Hadoop. Hive organizes data in tables and partitions. Like

in relational systems, partitions can be defined according to time intervals,

allowing Hive to prune data while processing a query. In addition, Hive

provides an SQL dialect called Hive Query Language (HiveQL) for querying

data stored in a Hadoop cluster. HiveQL is not only a query language but also

a data definition and manipulation language. The data definition language

is used to create, alter, and delete databases, tables, views, functions, and

indexes. The data manipulation language is used to insert, update, and delete

at the table level; these operations are not supported at the row level.

The Hive data model includes primitive data types like BOOLEAN and

INT and collection data types as STRUCT , MAP ,and ARRAY . Collection

data types allow, for example, many-to-many relationships to be represented,

avoiding foreign key relationships between tables. On the other hand, they

introduce data duplication and do not enforce referential integrity. As an

example, we show below a simplified representation of table Employees from

the Northwind database in Fig. 2.4 , where the attributes composing a full

address are stored in a STRUCT and the Territories attribute is an ARRAY

that contains the set of territory names to which the employee is related.

Hive has no control over how data are stored and supports different file and

record formats. The table schema is applied while the data are read from

storage, implementing what is known as schema on read. The example below

includes the file format definition ( TEXTFILE in this case) and the delimiter

characters needed to parse each record:

CREATE TABLE Employees (

EmployeeID INT, Name STRING,

Search WWH ::

Custom Search

Home