Database Reference
In-Depth Information
A selection would be coded as:
EmployeeThree = FILTER EmployeeLoad BY EmployeeID == ' 3 ' ;
DUMP
EmployeeThree;
Aggregation is also supported in Pig through the GROUP BY operation.
Assume we have the EmployeeLoad relation already loaded, and we want to
compute aggregates from these data. For this, we have to group the rows into
buckets. Over these grouped data, we can then run the aggregate functions.
For example, we can group employee data by FirstName :
byFirstName = GROUP EmployeeLoad BY FirstName;
The result of this operation is a new relation with two columns: one named
group and the other one with the name of the original relation. The former
contains the schema of the group, in our case a column of CHARARRAY type
containing all first names in the original table. This column can be directly
accessed as group.FirstName . The second column has the name of the original
relation and contains a bag of all the rows in such relation that match the
corresponding group, that is, the rows corresponding to employees with the
same first name.
The results can be then processed using the classic aggregate functions,
for example, COUNT ,andthe FOREACH operator, which performs a loop
over each bag, as follows:
FirstNameCount = FOREACH byFirstName GENERATE
GROUP AS FirstName
COUNT(EmployeeLoad),
We conclude with an example of a join operation. We want to join the files
storing orders and employees. The join must be performed on two attributes,
the ID of the employee and the postal code, in order to obtain the employees
that handled orders shipped to the place where they live. Finally, a projection
is performed:
Employees = LOAD ' /user/northwind/Employees.txt ' AS
(EmployeeID, LastName, ... , PhotoPath);
Orders = LOAD ' ./northwind/Orders.txt ' AS
(OrderID, CustomerID, EmployeeID, ... , ShipCountry);
Joined = JOIN Employees BY (EmployeeID, PostalCode),
Orders BY (EmployeeID, ShipPostalCode);
Projected = FOR EACH Joined GENERATE
Employees::EmployeeID, Employees::PostalCode,
Orders::CustomerID;
DUMP
Projected;
The first two statements load the two files into two variables, Employees and
Orders .The JOIN BY statement performs the join, similarly to SQL, and
GENERATE performs the projection.
The left and right outer joins are performed in a similar way, adding the
keywords LEFT OUTER and RIGHT OUTER , respectively, after the JOIN BY
Search WWH ::




Custom Search