Data Research and Advanced Data Cleansing with Pig and Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Extending Hive with Map-reduce Scripts

There are times when you need to create a custom data-processing

transformation that is not easy to achieve using HiveQL but fairly easy to

do with a scripting language. This is particularly useful when manipulating

if the result of the transform produces a different number of columns or

rows than the input. For example, you want to split up an input column into

several output columns using string-parsing functions. Another example is

a column containing a set of key/value pairs that need to be split out into

their own rows.

The input values sent to the script will consist of tab-delimited strings,

and the output values should also come back as tab-delimited strings. Any

null values sent to the script will be converted to the literal string \N to

differentiate it from an empty string.

Although technically you can create your script in any scripting language,

Pearl and Python seem to be the most popular. The code shown in Figure

9.16 is an example Python script that takes in a column formatted as

hh:mm:ss and splits it into separate columns for hour, minute, and second.

Figure 9.16 Python script for splitting time

To call this script from HiveQL, you use the TRANSFORM clause. You need

to provide the TRANSFORM clause, the input data, output columns, and

map-reduce script file. The following code uses the previous script. It takes

an input of a time column and a log level and parses the time. Figure 9.17

shows the output:

add file c:\sampledata\split_time.py;

SELECT TRANSFORM(l.t4, l.t2) USING 'python

Search WWH ::

Custom Search

Home