Database Reference
In-Depth Information
You load the text files under /user/hadoop/edgar/ on HDFS into the Hive table rawdata by using the LOAD DATA
statement.
hive> LOAD DATA INPATH '/user/hadoop/edgar/' INTO TABLE rawdata ;
The data in the rawdata table is converted to the word-count table via a CREATE TABLE with a sub SELECT . (I have
taken the liberty of adding line numbers to the script that follows to better explain this.) You have created the rawdata
table and populated it with data via a LOAD DATA command. The Hive QL script that will carry out the word count now
is as follows:
hive>
01 > CREATE TABLE wordcount AS
02 > SELECT
03 > word,
04 > count(1) AS count
05 > FROM
06 > (SELECT
07 > EXPLODE(SPLIT(line,' ')) AS word
08 > FROM
09 > rawdata
10 > ) words
11 > GROUP BY word
12 > ORDER BY word ;
Lines 6 to 10 create a derived table called words that takes data from the rawdata table. It does this by splitting the
rawdata.line column into a column called word in the derived table at line 7. The rawdata.line free text is split by
space characters so that the derived table column words.word contains a list of words.
The rest of the Hive QL then groups these words together (line 11), counts the instances of each (line 4), and
orders the list (line 12) that is output.
When you check the word-count table with a SELECT COUNT(*) command, you find there are over 36 thousand
rows:
hive> SELECT COUNT(*) FROM wordcount;
OK
36511
You can narrow your results by using SELECT, as well. Selecting the data from the word count where the count is
greater than 1,500 instances gives a short list of the most frequently occurring words in the data.
hive> SELECT
> word,
> count
> FROM
> wordcount
> WHERE
> count > 1500
> ORDER BY
> count ;
Search WWH ::




Custom Search