Databases Reference
In-Depth Information
I start out by using MapReduce to run a few queries that involve aggregate functions like sum ,
maximum , minimum , and average . The publicly available NYSE daily market data for the period
between 1970 and 2010 is used for the example. Because the data is aggregated on a daily basis,
only one data point represents a single trading day for a stock. Therefore, the data set is not large.
Certainly not large enough to be classifi ed big data. The example focuses on the essential mechanics
of MapReduce so the size doesn't really matter. I use two document databases, MongoDB and
CouchDB, in this example. The concept of MapReduce is not specifi c to these products and
applies to a large variety of NoSQL products including sorted, ordered column-family stores, and
distributed key/value maps. I start with document databases because they require the least amount
of effort around installation and setup and are easy to test in local standalone mode. MapReduce
with Hadoop and HBase is included later in this chapter.
To get started, download the zipped archive fi les for the daily NYSE market data from 1970 to
2010 from http://infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-
volume-nyse-exchange . Extract the zip fi le to a local folder. The unzipped NYSE data set contains
a number of fi les. Among these are two types of fi les: daily market data fi les and dividend data fi les.
For the sake of simplicity, I upload only the daily market data fi les into a database collection. This
means the only fi les you need from the set are those whose names start with NYSE_daily_prices_
and include a number or a letter at the end. All such fi les that have a number appended to the end
contain only header information and so can be skipped.
The database and collection in MongoDB are named mydb and nyse , respectively. The database in
CouchDB is named nyse . The data is available in comma-separated values ( .csv ) format, so
I leverage the mongoimport utility to import this data set into a MongoDB collection. Later in this
chapter, I use a Python script to load the same .csv fi les into CouchDB.
The mongoimport utility and its output, when uploading NYSE_daily_prices_A.csv, are as
follows:
~/Applications/mongodb/bin/mongoimport --type csv --db mydb --collection nyse --
headerline NYSE_daily_prices_A.csv
connected to: 127.0.0.1
4981480/40990992 12%
89700 29900/second
10357231/40990992 25%
185900 30983/second
15484231/40990992 37%
278000 30888/second
20647430/40990992 50%
370100 30841/second
25727124/40990992 62%
462300 30820/second
30439300/40990992 74%
546600 30366/second
35669019/40990992 87%
639600 30457/second
40652285/40990992 99%
729100 30379/second
imported 735027 objects
Search WWH ::




Custom Search