Database Reference
In-Depth Information
13.2 High-Level Languages for Hadoop
Using Hadoop is not easy for end users not familiar with MapReduce.
Users need to write MapReduce code even for simple tasks like counting
or averaging. A solution for this is to use high-level languages, which allow
programmers to work at a higher level of abstraction than in Java or
other lower-level languages supported by Hadoop. The most commonly used
such languages are Hive and Pig Latin. Both of them are translated into
MapReduce jobs, resulting in programs that are much smaller than the
equivalent Java ones. Besides, these languages can be extended, for example,
writing user-defined functions in Java. This can work the other way round:
programs written in high-level languages can be embedded in other languages
as well.
13.2.1 Hive
Hive, developed at Facebook, brings the concepts of tables, columns,
partitions, and SQL to the Hadoop architecture, keeping the extensibility
and flexibility of Hadoop. Hive organizes data in tables and partitions. Like
in relational systems, partitions can be defined according to time intervals,
allowing Hive to prune data while processing a query. In addition, Hive
provides an SQL dialect called Hive Query Language (HiveQL) for querying
data stored in a Hadoop cluster. HiveQL is not only a query language but also
a data definition and manipulation language. The data definition language
is used to create, alter, and delete databases, tables, views, functions, and
indexes. The data manipulation language is used to insert, update, and delete
at the table level; these operations are not supported at the row level.
The Hive data model includes primitive data types like BOOLEAN and
INT and collection data types as STRUCT , MAP ,and ARRAY . Collection
data types allow, for example, many-to-many relationships to be represented,
avoiding foreign key relationships between tables. On the other hand, they
introduce data duplication and do not enforce referential integrity. As an
example, we show below a simplified representation of table Employees from
the Northwind database in Fig. 2.4 , where the attributes composing a full
address are stored in a STRUCT and the Territories attribute is an ARRAY
that contains the set of territory names to which the employee is related.
Hive has no control over how data are stored and supports different file and
record formats. The table schema is applied while the data are read from
storage, implementing what is known as schema on read. The example below
includes the file format definition ( TEXTFILE in this case) and the delimiter
characters needed to parse each record:
CREATE TABLE Employees (
EmployeeID INT, Name STRING,
Search WWH ::




Custom Search