Databases Reference
In-Depth Information
...
... return { average:(sum/count) };
... };
> var average_rating_per_movie = db.ratings.mapReduce(map, reduce);
> db[average_rating_per_movie.result].find();
movielens_queries.txt
MapReduce allows you to write many types of sophisticated aggregation algorithms, some of which
were presented in this section. A few others are introduced later in the topic.
By now you have had a chance to understand many ways of querying MongoDB collections. Next,
you get a chance to familiarize yourself with querying tabular databases. HBase is used to illustrate
the querying mechanism.
ACCESSING DATA FROM COLUMN-ORIENTED
DATABASES LIKE HBASE
Before you get into querying an HBase data store, you need to store some data in it. As with
MongoDB, you have already had a fi rst taste of storing and accessing data in HBase and its
underlying fi lesystem, which often defaults to Hadoop Distributed FileSystem (HDFS). You are
also aware of HBase and Hadoop basics. This section builds on that basic familiarity. As a working
example, historical daily stock market data from NYSE since the 1970s until February 2010 is loaded
into an HBase instance. This loaded data set is accessed using an HBase-style querying mechanism.
The historical market data is collated from original sources by Infochimp.org and can be accessed at
www.infochimps.com/datasets/nyse-daily-1970-2010-open-close-high-low-and-volume .
The Historical Daily Market Data
The zipped-up download of the entire data set is substantial at 199 MB but very small by HDFS and
HBase standards. The HBase and Hadoop infrastructures are capable of and often used for dealing
with petabytes of data that span multiple physical machines. I chose an easily manageable data set
for the example as I intentionally want to avoid getting distracted by the immensity of preparing and
loading up a large data set for now. This chapter is about the query methods in NoSQL stores
and the focus in this section is on column-oriented databases. Understanding data access in smaller
data sets is more manageable and the concepts apply equally well to larger amounts of data.
The data fi elds are partitioned logically into three different types:
Combination of exchange, stock symbol, and date served as the unique id
The open, high, low, close, and adjusted close are a measure of price
The daily volume
The row-key can be created using a combination of the exchange, stock symbol, and date. So
NYSE,AA,2008-02-27 could be structured as NYSEAA20080227 to be a row-key for the data. All
price-related information can be stored in a column-family named price and volume data can be
stored in a column-family named volume .
Search WWH ::




Custom Search