Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

You load the text files under /user/hadoop/edgar/ on HDFS into the Hive table rawdata by using the LOAD DATA

statement.

hive> LOAD DATA INPATH '/user/hadoop/edgar/' INTO TABLE rawdata ;

The data in the rawdata table is converted to the word-count table via a CREATE TABLE with a sub SELECT . (I have

taken the liberty of adding line numbers to the script that follows to better explain this.) You have created the rawdata

table and populated it with data via a LOAD DATA command. The Hive QL script that will carry out the word count now

is as follows:

hive>

01 > CREATE TABLE wordcount AS

02 > SELECT

03 > word,

04 > count(1) AS count

05 > FROM

06 > (SELECT

07 > EXPLODE(SPLIT(line,' ')) AS word

08 > FROM

09 > rawdata

10 > ) words

11 > GROUP BY word

12 > ORDER BY word ;

Lines 6 to 10 create a derived table called words that takes data from the rawdata table. It does this by splitting the

rawdata.line column into a column called word in the derived table at line 7. The rawdata.line free text is split by

space characters so that the derived table column words.word contains a list of words.

The rest of the Hive QL then groups these words together (line 11), counts the instances of each (line 4), and

orders the list (line 12) that is output.

When you check the word-count table with a SELECT COUNT(*) command, you find there are over 36 thousand

rows:

hive> SELECT COUNT(*) FROM wordcount;

OK

36511

You can narrow your results by using SELECT, as well. Selecting the data from the word count where the count is

greater than 1,500 instances gives a short list of the most frequently occurring words in the data.

hive> SELECT

> word,

> count

> FROM

> wordcount

> WHERE

> count > 1500

> ORDER BY

> count ;

Search WWH ::

Custom Search

Home