Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

In this case, there is only one file, sample.txt , but in general there can be more, and Hive

will read all of them when querying the table.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing

files in the directory for the table. If it is omitted, the new files are simply added to the

table's directory (unless they have the same names, in which case they replace the old

files).

Now that the data is in Hive, we can run a query against it:

hive> SELECT year, MAX(temperature)

> FROM records

> WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)

> GROUP BY year;

1949 111

1950 22

This SQL query is unremarkable. It is a SELECT statement with a GROUP BY clause for

grouping rows into years, which uses the MAX aggregate function to find the maximum

temperature for each year group. The remarkable thing is that Hive transforms this query

into a job, which it executes on our behalf, then prints the results to the console. There are

some nuances, such as the SQL constructs that Hive supports and the format of the data

that we can query — and we explore some of these in this chapter — but it is the ability to

execute SQL queries against our raw data that gives Hive its power.

Search WWH ::

Custom Search

Home