Database Reference
In-Depth Information
User-Defined Functions
Sometimes the query you want to write can't be expressed easily (or at all) using the built-
in functions that Hive provides. By allowing you to write a
user-defined function
(UDF),
Hive makes it easy to plug in your own processing code and invoke it from a Hive query.
UDFs have to be written in Java, the language that Hive itself is written in. For other lan-
guages, consider using a
SELECT TRANSFORM
query, which allows you to stream data
through a user-defined script (
MapReduce Scripts
).
There are three types of UDF in Hive: (regular) UDFs, user-defined aggregate functions
(UDAFs), and user-defined table-generating functions (UDTFs). They differ in the number
of rows that they accept as input and produce as output:
▪ A UDF operates on a single row and produces a single row as its output. Most
functions, such as mathematical functions and string functions, are of this type.
▪ A UDAF works on multiple input rows and creates a single output row. Aggregate
functions include such functions as
COUNT
and
MAX
.
▪ A UDTF operates on a single row and produces multiple rows — a table — as out-
put.
Table-generating functions are less well known than the other two types, so let's look at an
example. Consider a table with a single column,
x
, which contains arrays of strings. It's in-
structive to take a slight detour to see how the table is defined and populated:
CREATE TABLE arrays (x ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002';
Notice that the
ROW FORMAT
clause specifies that the entries in the array are delimited by
Ctrl-B characters. The example file that we are going to load has the following contents,
where
^B
is a representation of the Ctrl-B character to make it suitable for printing:
a^Bb
c^Bd^Be
After running a
LOAD DATA
command, the following query confirms that the data was
loaded correctly:
hive>
SELECT * FROM arrays;
["a","b"]
["c","d","e"]