New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

of each product. The CLUSTER BY clause tells Hive to distribute the Map

output ( mapOut ) to the reducers by hashing on ProductID .

13.2.2 Pig Latin

Pig is a high-level data flow language for querying data stored on HDFS. It

was developed at Yahoo! Research and then moved to the Apache Software

Foundation. There are three different ways to run Pig: (a) as a script, just by

passing the name of the script file to the Pig command; (b) using the grunt

command line; and (c) calling Pig from Java in its embedded form. A Pig

Latin program is a collection of statements, which can either be an operation

or a command. For example, the LOAD operation with a file name as an

argument loads data from a file. A command could be an HDFS command

used directly within Pig Latin, such as the ls command to list all files in the

current directory. The execution of a statement does not necessarily result in

a job running on the Hadoop cluster.

Pig does not require schema information, which makes it suitable for

unstructured data. If a schema of the data is available, Pig will make use of it,

both for error checking and optimization. However, if no schema is available,

Pig will still process the data making the best guesses it can. Pig data types

can be of two kinds. Scalar types are the usual data types, like INTEGER ,

LONG , FLOAT ,and CHARARRAY . On the other hand, three kinds of complex

typesaresupportedinPig,namely, TUPLE , BAG ,and MAP , where the latter

is a set of key-value pairs. For example, depending on schema availability, we

can load employee data in several ways as follows:

Employees = LOAD ' Employees ' AS (Name:chararray, City:chararray, Age:int);

Employees = LOAD ' Employees ' AS (Name, City, Age);

Employees = LOAD ' Employees ' ;

corresponding, respectively, to whether there is explicit schema and data

types, explicit schema without data types, or no schema.

As an example, we show how relational algebra operations can be

implemented in Pig, using the Northwind database of Fig. 2.4 .Westart

with the projection. Suppose we have loaded the Employees table into the

EmployeeLoad.txt text file:

EmployeeLoad = LOAD ' /user/northwind/Employees.txt ' AS

(EmployeeID, LastName, FirstName, Title, ... , PhotoPath);

EmployeeData = FOR EACH EmployeeLoad GENERATE

EmployeeID, LastName, FirstName;

DUMP EmployeeData;

STORE EmployeeData INTO ' /home/results/projected ' ;

Most of the steps are self-explanatory. The GENERATE instruction projects

the first three attributes in the file Employees.txt stored in the variable

EmployeeLoad .

Search WWH ::

Custom Search

Home