Database Reference
In-Depth Information
A selection would be coded as:
EmployeeThree = FILTER EmployeeLoad BY EmployeeID ==
'
3
'
;
DUMP
EmployeeThree;
Aggregation is also supported in Pig through the
GROUP BY
operation.
Assume we have the
EmployeeLoad
relation already loaded, and we want to
compute aggregates from these data. For this, we have to group the rows into
buckets. Over these grouped data, we can then run the aggregate functions.
For example, we can group employee data by
FirstName
:
byFirstName = GROUP EmployeeLoad BY FirstName;
The result of this operation is a new relation with two columns: one named
group
and the other one with the name of the original relation. The former
contains the schema of the group, in our case a column of
CHARARRAY
type
containing all first names in the original table. This column can be directly
accessed as
group.FirstName
. The second column has the name of the original
relation and contains a bag of all the rows in such relation that match the
corresponding group, that is, the rows corresponding to employees with the
same first name.
The results can be then processed using the classic aggregate functions,
for example,
COUNT
,andthe
FOREACH
operator, which performs a loop
over each bag, as follows:
FirstNameCount = FOREACH byFirstName GENERATE
GROUP AS FirstName
COUNT(EmployeeLoad),
We conclude with an example of a join operation. We want to join the files
storing orders and employees. The join must be performed on two attributes,
the ID of the employee and the postal code, in order to obtain the employees
that handled orders shipped to the place where they live. Finally, a projection
is performed:
Employees = LOAD
'
/user/northwind/Employees.txt
'
AS
(EmployeeID, LastName, ... , PhotoPath);
Orders = LOAD
'
./northwind/Orders.txt
'
AS
(OrderID, CustomerID, EmployeeID, ... , ShipCountry);
Joined = JOIN Employees BY (EmployeeID, PostalCode),
Orders BY (EmployeeID, ShipPostalCode);
Projected = FOR EACH Joined GENERATE
Employees::EmployeeID, Employees::PostalCode,
Orders::CustomerID;
DUMP
Projected;
The first two statements load the two files into two variables,
Employees
and
Orders
.The
JOIN BY
statement performs the join, similarly to SQL, and
GENERATE
performs the projection.
The left and right outer joins are performed in a similar way, adding the
keywords
LEFT OUTER
and
RIGHT OUTER
, respectively, after the
JOIN BY
Search WWH ::
Custom Search