Database Reference
In-Depth Information
A more complex UDAF
The previous example is unusual in that a partial aggregation can be represented using the
same type ( IntWritable ) as the final result. This is not generally the case for more
complex aggregate functions, as can be seen by considering a UDAF for calculating the
mean (average) of a collection of double values. It's not mathematically possible to com-
bine partial means into a final mean value (see Combiner Functions ). Instead, we can rep-
resent the partial aggregation as a pair of numbers: the cumulative sum of the double val-
ues processed so far, and the number of values.
This idea is implemented in the UDAF shown in Example 17-4 . Notice that the partial ag-
gregation is implemented as a “struct” nested static class, called PartialResult ,
which Hive is intelligent enough to serialize and deserialize, since we are using field types
that Hive can handle (Java primitives in this case).
In this example, the merge() method is different from iterate() because it com-
bines the partial sums and partial counts by pairwise addition. In addition to this, the re-
turn type of terminatePartial() is PartialResult — which, of course, is nev-
er seen by the user calling the function — whereas the return type of terminate() is
DoubleWritable , the final result seen by the user.
Example 17-4. A UDAF for calculating the mean of a collection of doubles
package com . hadoopbook . hive ;
import org.apache.hadoop.hive.ql.exec.UDAF ;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator ;
import org.apache.hadoop.hive.serde2.io.DoubleWritable ;
public class Mean extends UDAF {
public static class MeanDoubleUDAFEvaluator implements UDAFEvaluator {
public static class PartialResult {
double sum ;
long count ;
}
private PartialResult partial ;
public void init () {
partial = null ;
}
public boolean iterate ( DoubleWritable value ) {
if ( value == null ) {
Search WWH ::




Custom Search