Databases Reference
In-Depth Information
BACK TO MOVIE RATINGS
In Chapter 6, you learned about querying NoSQL stores. In that chapter, I leveraged a freely available
movie ratings data set to illustrate the query mechanisms available in NoSQL stores, especially in
MongoDB. Let's revisit that data set and use Hive to manipulate it. You may benefi t from reviewing
the MovieLens example in Chapter 6 before you move forward.
You can download the movie lens data set that contains 1 million ratings with the following command:
curl -O http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
Extract the tarball and you should get the following fi les:
README
movies.dat
ratings.dat
users.dat
The ratings.dat fi le contains rating data where each line contains one rating data point. Each
data point in the ratings fi le is structured in the following format: UserID::MovieID::Rating::
Timestamp .
The ratings, movie, and users data in the movie lens data set is separated by
:: . I had trouble getting the Hive loader to correctly parse and load the data
using this delimiter. So, I chose to replace :: with # throughout the fi le. I simply
opened the fi le in vi and replaced all occurrences of :: , the delimiter, with #
using the following command:
:%s/::/#/g
Once the delimiter was modifi ed I saved the results to new fi les, each with
.hash_delimited appended to their old names. Therefore, I had three new fi les:
ratings.dat.hash_delimited
movied.dat.hash_delimited
users.dat.hash_delimited
I used the new fi les as the source data. The original .dat fi les were left as is.
Load the data into a Hive table that follows the same schema as in the downloaded ratings data fi le.
That means fi rst create a Hive table with the same schema:
hive> CREATE TABLE ratings(
> userid INT,
> movieid INT,
Available for
download on
Wrox.com
Search WWH ::




Custom Search