Database Reference
In-Depth Information
of each product. The CLUSTER BY clause tells Hive to distribute the Map
output ( mapOut ) to the reducers by hashing on ProductID .
13.2.2 Pig Latin
Pig is a high-level data flow language for querying data stored on HDFS. It
was developed at Yahoo! Research and then moved to the Apache Software
Foundation. There are three different ways to run Pig: (a) as a script, just by
passing the name of the script file to the Pig command; (b) using the grunt
command line; and (c) calling Pig from Java in its embedded form. A Pig
Latin program is a collection of statements, which can either be an operation
or a command. For example, the LOAD operation with a file name as an
argument loads data from a file. A command could be an HDFS command
used directly within Pig Latin, such as the ls command to list all files in the
current directory. The execution of a statement does not necessarily result in
a job running on the Hadoop cluster.
Pig does not require schema information, which makes it suitable for
unstructured data. If a schema of the data is available, Pig will make use of it,
both for error checking and optimization. However, if no schema is available,
Pig will still process the data making the best guesses it can. Pig data types
can be of two kinds. Scalar types are the usual data types, like INTEGER ,
LONG , FLOAT ,and CHARARRAY . On the other hand, three kinds of complex
typesaresupportedinPig,namely, TUPLE , BAG ,and MAP , where the latter
is a set of key-value pairs. For example, depending on schema availability, we
can load employee data in several ways as follows:
Employees = LOAD ' Employees ' AS (Name:chararray, City:chararray, Age:int);
Employees = LOAD ' Employees ' AS (Name, City, Age);
Employees = LOAD ' Employees ' ;
corresponding, respectively, to whether there is explicit schema and data
types, explicit schema without data types, or no schema.
As an example, we show how relational algebra operations can be
implemented in Pig, using the Northwind database of Fig. 2.4 .Westart
with the projection. Suppose we have loaded the Employees table into the
EmployeeLoad.txt text file:
EmployeeLoad = LOAD ' /user/northwind/Employees.txt ' AS
(EmployeeID, LastName, FirstName, Title, ... , PhotoPath);
EmployeeData = FOR EACH EmployeeLoad GENERATE
EmployeeID, LastName, FirstName;
DUMP EmployeeData;
STORE EmployeeData INTO ' /home/results/projected ' ;
Most of the steps are self-explanatory. The GENERATE instruction projects
the first three attributes in the file Employees.txt stored in the variable
EmployeeLoad .
Search WWH ::




Custom Search