Database Reference
In-Depth Information
• Data is already in HDFS. (Note: Non-HDFS files can be loaded into a Hive
table.)
• Developers are comfortable with SQL programming and queries.
• There is a desire to partition datasets based on time. (For example, daily
updates are added to the Hive table.)
• Batch processing is acceptable.
The remainder of the Hive discussion covers some HiveQL basics. From the
command prompt, a user enters the interactive Hive environment by simply
entering hive :
$ hive
hive>
From this environment, a user can define new tables, query them, or summarize
their contents. To illustrate how to use HiveQL, the following example defines a
new Hive table to hold customer data, load existing HDFS data into the Hive table,
and query the table.
The first step is to create a table called customer to store customer details.
Because the table will be populated from an existing tab ('\t')-delimited HDFS file,
this format is specified in the table creation query.
hive> create table customer (
cust_id bigint,
first_name string,
last_name string,
email_address string)
row format delimited
fields terminated by '\t';
The following HiveQL query is executed to count the number of records in the
newly created table, customer . Because the table is currently empty, the query
returns a result of zero, the last line of the provided output. The query is converted
and run as a MapReduce job, which results in one map task and one reduce task
being executed.
hive> select count(*) from customer;
Total MapReduce jobs = 1
Search WWH ::




Custom Search