Data Research and Advanced Data Cleansing with Pig and Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

For example, the following Hive query uses the date function to get the day

from the flightdate field:

Select day(flightdate), airport_cd, delay

from flightdata where delay > 100;

Whereas UDFs work on single rows at a time and processing occurs on

the map side of the processing, UDAFs work on buckets of data and are

implemented on the reduce side of the processing. For example, you can use

the built-in UDAF count and max to get the delay counts and maximum

delays by day:

Select day(flightdate), count(*), max(delay)

from flightdata group by day(flightdate);

Another type of function used in Hive is a UDFT. This type of function takes

a single-row input and produces multiple-row outputs. These functions are

useful for taking a column containing an array that needs to be split out into

multiple rows. Hive's built-in UDFTs include the following:

• The explode function, which takes an array as input and splits it out

into multiple rows.

• The json_tuple function, which is useful for querying

JSON-formatted nodes. It takes the JSON node and splits out the child

nodes into separate rows into a virtual table for better processing

performance.

• The parse_url_tuple function, which takes a URL and extracts parts

of a URL string into multiple rows in a table structure.

The file shown in Figure 9.13 contains student data in JSON format.

Figure 9.13 JSON formatted data

To parse the values from the JSON nodes, you use the json_tuple

function in combination with the lateral view operator to create a table:

Search WWH ::

Custom Search

Home