Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

• Data is already in HDFS. (Note: Non-HDFS files can be loaded into a Hive

table.)

• Developers are comfortable with SQL programming and queries.

• There is a desire to partition datasets based on time. (For example, daily

updates are added to the Hive table.)

• Batch processing is acceptable.

The remainder of the Hive discussion covers some HiveQL basics. From the

command prompt, a user enters the interactive Hive environment by simply

entering hive :

$ hive

hive>

From this environment, a user can define new tables, query them, or summarize

their contents. To illustrate how to use HiveQL, the following example defines a

new Hive table to hold customer data, load existing HDFS data into the Hive table,

and query the table.

The first step is to create a table called customer to store customer details.

Because the table will be populated from an existing tab ('\t')-delimited HDFS file,

this format is specified in the table creation query.

hive> create table customer (

cust_id bigint,

first_name string,

last_name string,

email_address string)

row format delimited

fields terminated by '\t';

The following HiveQL query is executed to count the number of records in the

newly created table, customer . Because the table is currently empty, the query

returns a result of zero, the last line of the provided output. The query is converted

and run as a MapReduce job, which results in one map task and one reduce task

being executed.

hive> select count(*) from customer;

Total MapReduce jobs = 1

Search WWH ::

Custom Search

Home