Database Reference
In-Depth Information
In this case, there is only one file, sample.txt , but in general there can be more, and Hive
will read all of them when querying the table.
The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing
files in the directory for the table. If it is omitted, the new files are simply added to the
table's directory (unless they have the same names, in which case they replace the old
files).
Now that the data is in Hive, we can run a query against it:
hive> SELECT year, MAX(temperature)
> FROM records
> WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)
> GROUP BY year;
1949 111
1950 22
This SQL query is unremarkable. It is a SELECT statement with a GROUP BY clause for
grouping rows into years, which uses the MAX aggregate function to find the maximum
temperature for each year group. The remarkable thing is that Hive transforms this query
into a job, which it executes on our behalf, then prints the results to the console. There are
some nuances, such as the SQL constructs that Hive supports and the format of the data
that we can query — and we explore some of these in this chapter — but it is the ability to
execute SQL queries against our raw data that gives Hive its power.
Search WWH ::




Custom Search