Data Research and Advanced Data Cleansing with Pig and Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Add a reference to the file and the TRANSFORM statement to call the script:

add file C:\SampleData\get_maxValue.py;

SELECT

TRANSFORM(s.recdate,s.sensor,s.v1,s.v2,s.v3,s.v4)

USING 'python get_maxValue.py'

AS (recdate,sensor,maxvalue) FROM speeds s;

The data output should look similar to Figure 9.23 .

Figure 9.23 Output of the get_maxValue.py script

Creating Your Own UDFs for Hive

As mentioned previously, Hive contains a number of function types

depending on the processing involved. The simplest type is the UDF, which

takes a row in, processes it, and returns the row back. The UDAF is a

little more involved because it performs an aggregation on input values and

reduces the number of rows coming out. The other type of function you can

create is the UDTF, which takes a row in and parses it out into a table.

If you followed along in the earlier section on building custom UDFs for

Pig, you will find that building UDFs for Hive is a similar experience. First,

you create a project in your favorite Java development environment. Then,

you add a reference to the hive-exec.jar and the hive-serde.jar

files. These are located in the hive folder in the lib subfolder. After you

add these references, you add an import statement to the

org.apache.hadoop.hive.ql.exec.UDF class and extend it with a

custom class:

Search WWH ::

Custom Search

Home