Analyzing Big Data with Hive - Professional NoSQL - page 239

Databases Reference

In-Depth Information

BACK TO MOVIE RATINGS

In Chapter 6, you learned about querying NoSQL stores. In that chapter, I leveraged a freely available

movie ratings data set to illustrate the query mechanisms available in NoSQL stores, especially in

MongoDB. Let's revisit that data set and use Hive to manipulate it. You may benefi t from reviewing

the MovieLens example in Chapter 6 before you move forward.

You can download the movie lens data set that contains 1 million ratings with the following command:

curl -O http://www.grouplens.org/system/files/million-ml-data.tar__0.gz

Extract the tarball and you should get the following fi les:

README

➤

➤

movies.dat

ratings.dat

➤

users.dat

The ratings.dat fi le contains rating data where each line contains one rating data point. Each

data point in the ratings fi le is structured in the following format: UserID::MovieID::Rating::

Timestamp .

➤

The ratings, movie, and users data in the movie lens data set is separated by

:: . I had trouble getting the Hive loader to correctly parse and load the data

using this delimiter. So, I chose to replace :: with # throughout the fi le. I simply

opened the fi le in vi and replaced all occurrences of :: , the delimiter, with #

using the following command:

:%s/::/#/g

Once the delimiter was modifi ed I saved the results to new fi les, each with

.hash_delimited appended to their old names. Therefore, I had three new fi les:

ratings.dat.hash_delimited

➤

➤

movied.dat.hash_delimited

users.dat.hash_delimited

I used the new fi les as the source data. The original .dat fi les were left as is.

➤

Load the data into a Hive table that follows the same schema as in the downloaded ratings data fi le.

That means fi rst create a Hive table with the same schema:

hive> CREATE TABLE ratings(

> userid INT,

> movieid INT,

Available for

download on

Wrox.com

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home