Database Reference
In-Depth Information
For example, the following Hive query uses the date function to get the day
from the flightdate field:
Select day(flightdate), airport_cd, delay
from flightdata where delay > 100;
Whereas UDFs work on single rows at a time and processing occurs on
the map side of the processing, UDAFs work on buckets of data and are
implemented on the reduce side of the processing. For example, you can use
the built-in UDAF count and max to get the delay counts and maximum
delays by day:
Select day(flightdate), count(*), max(delay)
from flightdata group by day(flightdate);
Another type of function used in Hive is a UDFT. This type of function takes
a single-row input and produces multiple-row outputs. These functions are
useful for taking a column containing an array that needs to be split out into
multiple rows. Hive's built-in UDFTs include the following:
• The explode function, which takes an array as input and splits it out
into multiple rows.
• The json_tuple function, which is useful for querying
JSON-formatted nodes. It takes the JSON node and splits out the child
nodes into separate rows into a virtual table for better processing
performance.
• The parse_url_tuple function, which takes a URL and extracts parts
of a URL string into multiple rows in a table structure.
The file shown in Figure 9.13 contains student data in JSON format.
Figure 9.13 JSON formatted data
To parse the values from the JSON nodes, you use the json_tuple
function in combination with the lateral view operator to create a table:
 
 
Search WWH ::




Custom Search