Database Reference
In-Depth Information
To use the DataFu UDFs, download the
datafu.jar
file from
www.wiley.com/go/microsoftbigdatasolutions
and place it in the same
directory as the
piggybank.jar
file. You can now reference the jar file
in your script. Define an alias for the
Quantile
function and provide the
quantile values you want to calculate:
REGISTER
'C:\hdp\hadoop\pig-0.11.0.1.3.0.0-0380\datafu-0.0.10.jar';
DEFINE Quantile datafu.pig.stats.Quantile('.10','.90');
Load and group the data:
SpeedData = LOAD '/user/test/traffic.txt' using
PigStorage()
AS (dtstamp:chararray, sensorid:int, speed:double);
SpeedDataGrouped = Group SpeedData ALL;
Pass sorted data to the
Quantile
function and dump the results out to the
command-line console (see
Figure 9.11
)
. Using this data, you can then write
a script to filter out the outliers:
QuantSpeeds = ForEach SpeedDataGrouped
{ SpeedSorted = ORDER SpeedData BY speed;
GENERATE Quantile(SpeedData.speed);};
Dump QuantSpeeds;