New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

Address STRUCT < Street:STRING, City:STRING,

Region:STRING, PostalCode:STRING, Country:STRING > ,

Territories ARRAY < STRING > )

ROW FORMAT

DELIMITED FIELDS TERMINATED BY ' , '

COLLECTION ITEMS TERMINATED BY '

|

'

LINES TERMINATED BY '

\

n '

STORED AS TEXTFILE;

HiveQL allows to implement the typical relational operations. The query

below performs a projection over the Name and City attributes:

SELECT Name, Address.City

FROM Employees;

A selection operation that obtains employees related to more than four

territories is expressed as follows:

SELECT *

FROM Employees

WHERE Size(Territories) > 4;

HiveQL supports different join operations, such as INNER JOIN , OUTER

JOIN ,and LEFT SEMI JOIN , among others. Below, we join tables Employees

and Orders :

SELECT *

FROM Employees E JOIN Orders O ON E.EmployeeID = O.EmployeeID

HiveQL also supports other SQL-like clauses, for example, GROUP BY ,

HAVING ,and ORDER BY .

Hive also supports computations that go beyond SQL-like languages,

for example, generating machine learning models. For this, Hive provides

language constructs that allow users to plug in their own transformation

scripts in an SQL statement. This is done through the MAP , REDUCE ,

TRANSFORM , DISTRIBUTE BY , SORT BY ,and CLUSTER BY keywords

in the SQL extensions. As an example, we show how we can write a Hive

program to count the occurrences of products in an input file, like in the

example of Fig. 13.1 . This is a variant of the typical word count example:

CREATE TABLE Products (Content STRING);

FROM

(MAP Products.Content

USING ' tokenizerScript ' AS ProductID, Count

FROM Products

CLUSTER BY ProductID) mapOut

REDUCE mapOut.ProductID, mapOut.Count

USING ' countScript ' AS ProductID, Count;

The scripts tokenizerScript and countScript can be implemented in any

language, like Python or Java. The former script produces a tuple for each

new product in the input; the latter script counts the number of occurrences

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home