Database Reference
In-Depth Information
In this case, there is only one file,
sample.txt
, but in general there can be more, and Hive
will read all of them when querying the table.
The
OVERWRITE
keyword in the
LOAD DATA
statement tells Hive to delete any existing
files in the directory for the table. If it is omitted, the new files are simply added to the
table's directory (unless they have the same names, in which case they replace the old
files).
Now that the data is in Hive, we can run a query against it:
hive>
SELECT year, MAX(temperature)
>
FROM records
>
WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)
>
GROUP BY year;
1949 111
1950 22
This SQL query is unremarkable. It is a
SELECT
statement with a
GROUP BY
clause for
grouping rows into years, which uses the
MAX
aggregate function to find the maximum
temperature for each year group. The remarkable thing is that Hive transforms this query
into a job, which it executes on our behalf, then prints the results to the console. There are
some nuances, such as the SQL constructs that Hive supports and the format of the data
that we can query — and we explore some of these in this chapter — but it is the ability to
execute SQL queries against our raw data that gives Hive its power.