Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

An evaluator must implement five methods, described in turn here (the flow is illustrated

in Figure 17-3 ):

init()

The init() method initializes the evaluator and resets its internal state. In Maxim-

umIntUDAFEvaluator , we set the IntWritable object holding the final result

to null . We use null to indicate that no values have been aggregated yet, which has

the desirable effect of making the maximum value of an empty set NULL .

iterate()

The iterate() method is called every time there is a new value to be aggregated.

The evaluator should update its internal state with the result of performing the aggrega-

tion. The arguments that iterate() takes correspond to those in the Hive function

from which it was called. In this example, there is only one argument. The value is first

checked to see whether it is null , and if it is, it is ignored. Otherwise, the result

instance variable is set either to value 's integer value (if this is the first value that has

been seen) or to the larger of the current result and value (if one or more values have

already been seen). We return true to indicate that the input value was valid.

terminatePartial()

The terminatePartial() method is called when Hive wants a result for the par-

tial aggregation. The method must return an object that encapsulates the state of the ag-

gregation. In this case, an IntWritable suffices because it encapsulates either the

maximum value seen or null if no values have been processed.

merge()

The merge() method is called when Hive decides to combine one partial aggregation

with another. The method takes a single object, whose type must correspond to the re-

turn type of the terminatePartial() method. In this example, the merge()

method can simply delegate to the iterate() method because the partial aggrega-

tion is represented in the same way as a value being aggregated. This is not generally

the case (we'll see a more general example later), and the method should implement

the logic to combine the evaluator's state with the state of the partial aggregation.

terminate()

The terminate() method is called when the final result of the aggregation is

needed. The evaluator should return its state as a value. In this case, we return the

result instance variable.

Search WWH ::

Custom Search

Home