Database Reference
In-Depth Information
Hive UDFs
Hive makes possible the creation of user-defined functions (UDFs), with which you can extend and customize the
functionality of Hive. To demonstrate the process, I create a simple date-conversion function.
Suppose that the date columns in the CSV data that was collected have the wrong format for Hive; specifically,
the dates follow the format dd/MM/yyyy, when they need to have a format of yyyy-MM-dd. Using Java date methods,
I can create a simple Java-based UDF to change the date format. When compiled and loaded into a library, this
function can then be embedded in the Hive QL statements.
To create a Hive UDF, I must install the Scala sbt interactive build tool on the server to compile the Java UDF
package. I create the example UDF on the server hc1nn, so I need to install the sbt program on that server by using the
Linux root account. I download an rpm package for sbt from the scala-sbt.org website to the /tmp directory on hc1nn,
and I install it from there. The following command moves to the /tmp directory and downloads the sbt.rpm package
by using wget :
[root@hc1nn ~]# cd /tmp
[root@hc1nn ~]#wget http://repo.scala-sbt.org/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.13.1/
sbt.rpm
[root@hc1nn ~]# rpm -ivh sbt.rpm
The final command, rpm , installs the sbt.rpm package with options I for install and v for verify. I also install the
Java OpenJDK 1.6 development package to support this compilation as the root Linux user (because I want access to
tools like jar and jps). I use the openJDK because I can install it via the yum command, and I don't have to go through a
registration process to get it.
[root@hc2nn ~]# yum install java-1.6.0-openjdk-devel
I compile the new Hive UDF function as the Linux hadoop user, so I use su to change to that account:
[root@hc2nn ~]# su - hadoop
Next, I need to set up a directory structure that will hold the UDF code, so initially I create the directories hive/udf
in the hadoop account home directory to hold my Apache Hive UDF code. Next, I move to that new udf directory:
[hadoop@hc2nn ~]$ mkdir -p hive/udf
[hadoop@hc2nn ~]$ cd hive/udf
At this level, I have created a file called build.sbt that the sbt tool will use to aid in the compilation of the UDF.
It describes details like the function name, the version, the organization that it belongs to, and the version of Scala
installed. Here's the contents of the file displayed by using the Linux cat command; I have added line numbers to aid
understanding:
[hadoop@hc2nn udf]$ cat build.sbt
01 name := "DateConv"
02
03 version := "0.1"
04
05 organization := "nz.co.semtechsolutions"
06
07 scalaVersion := "2.10.4"
08
 
Search WWH ::




Custom Search