Data Research and Advanced Data Cleansing with Pig and Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

return false;

}

else

return false;

}

catch(Exception e)

{

throw new IOException

("Caught exception processing

input row ", e);

}

It extends the FilterFunc class and includes an exec function that checks

to confirm whether the tuple passed in is not null and makes sure that it has

only one member. It then confirms whether it is an integer and returns true

if it is greater than zero; otherwise, it returns false.

Some other UDF types are the aggregation, load, and store functions. The

functions shown here are the bare-bones implementations. You also need to

consider error handling, progress reporting, and output schema typing. For

more information on custom UDF creation, consult the UDF manual on the

Apache Pig wiki ( http://wiki.apache.org/pig/UDFManual ).

Using Hive

Another tool available to create and run map-reduce jobs in Hadoop is Hive.

One of the major advantages of Hive is that it creates a relational database

layer over the data files. Using this paradigm, you can work with the data

using traditional querying techniques, which is very beneficial if you have a

SQL background. In addition, you do not have to worry about how the query

is translated into themap-reduce job. There is a query engine that works out

the details of what is the most efficient way of loading and aggregating the

data.

In the following sections you will gain an understanding of how to perform

advanced data analysis with Hive. First you will look at the different types

of built-in Hive functions available. Next, you will see how to extend Hive

Search WWH ::

Custom Search

Home