Analytics with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Hive UDFs

Hive makes possible the creation of user-defined functions (UDFs), with which you can extend and customize the

functionality of Hive. To demonstrate the process, I create a simple date-conversion function.

Suppose that the date columns in the CSV data that was collected have the wrong format for Hive; specifically,

the dates follow the format dd/MM/yyyy, when they need to have a format of yyyy-MM-dd. Using Java date methods,

I can create a simple Java-based UDF to change the date format. When compiled and loaded into a library, this

function can then be embedded in the Hive QL statements.

To create a Hive UDF, I must install the Scala sbt interactive build tool on the server to compile the Java UDF

package. I create the example UDF on the server hc1nn, so I need to install the sbt program on that server by using the

Linux root account. I download an rpm package for sbt from the scala-sbt.org website to the /tmp directory on hc1nn,

and I install it from there. The following command moves to the /tmp directory and downloads the sbt.rpm package

by using wget :

[root@hc1nn ~]# cd /tmp

sbt.rpm

[root@hc1nn ~]# rpm -ivh sbt.rpm

The final command, rpm , installs the sbt.rpm package with options I for install and v for verify. I also install the

Java OpenJDK 1.6 development package to support this compilation as the root Linux user (because I want access to

tools like jar and jps). I use the openJDK because I can install it via the yum command, and I don't have to go through a

registration process to get it.

[root@hc2nn ~]# yum install java-1.6.0-openjdk-devel

I compile the new Hive UDF function as the Linux hadoop user, so I use su to change to that account:

[root@hc2nn ~]# su - hadoop

Next, I need to set up a directory structure that will hold the UDF code, so initially I create the directories hive/udf

in the hadoop account home directory to hold my Apache Hive UDF code. Next, I move to that new udf directory:

[hadoop@hc2nn ~]$ mkdir -p hive/udf

[hadoop@hc2nn ~]$ cd hive/udf

At this level, I have created a file called build.sbt that the sbt tool will use to aid in the compilation of the UDF.

It describes details like the function name, the version, the organization that it belongs to, and the version of Scala

installed. Here's the contents of the file displayed by using the Linux cat command; I have added line numbers to aid

understanding:

[hadoop@hc2nn udf]$ cat build.sbt

01 name := "DateConv"

02

03 version := "0.1"

04

05 organization := "nz.co.semtechsolutions"

06

07 scalaVersion := "2.10.4"

08

Search WWH ::

Custom Search

Home