Database Reference
In-Depth Information
A more complex UDAF
The previous example is unusual in that a partial aggregation can be represented using the
same type (
IntWritable
) as the final result. This is not generally the case for more
complex aggregate functions, as can be seen by considering a UDAF for calculating the
mean (average) of a collection of double values. It's not mathematically possible to com-
bine partial means into a final mean value (see
Combiner Functions
). Instead, we can rep-
resent the partial aggregation as a pair of numbers: the cumulative sum of the double val-
ues processed so far, and the number of values.
This idea is implemented in the UDAF shown in
Example 17-4
.
Notice that the partial ag-
gregation is implemented as a “struct” nested static class, called
PartialResult
,
which Hive is intelligent enough to serialize and deserialize, since we are using field types
that Hive can handle (Java primitives in this case).
In this example, the
merge()
method is different from
iterate()
because it com-
bines the partial sums and partial counts by pairwise addition. In addition to this, the re-
turn type of
terminatePartial()
is
PartialResult
— which, of course, is nev-
er seen by the user calling the function — whereas the return type of
terminate()
is
DoubleWritable
, the final result seen by the user.
Example 17-4. A UDAF for calculating the mean of a collection of doubles
package
com
.
hadoopbook
.
hive
;
import
org.apache.hadoop.hive.ql.exec.UDAF
;
import
org.apache.hadoop.hive.ql.exec.UDAFEvaluator
;
import
org.apache.hadoop.hive.serde2.io.DoubleWritable
;
public class
Mean
extends
UDAF
{
public static class
MeanDoubleUDAFEvaluator
implements
UDAFEvaluator
{
public static class
PartialResult
{
double
sum
;
long
count
;
}
private
PartialResult partial
;
public
void
init
() {
partial
=
null
;
}
public
boolean
iterate
(
DoubleWritable value
) {
if
(
value
==
null
) {