Scalable Parallel Processing with MapReduce - Professional NoSQL - page 219

Databases Reference

In-Depth Information

I start out by using MapReduce to run a few queries that involve aggregate functions like sum ,

maximum , minimum , and average . The publicly available NYSE daily market data for the period

between 1970 and 2010 is used for the example. Because the data is aggregated on a daily basis,

only one data point represents a single trading day for a stock. Therefore, the data set is not large.

Certainly not large enough to be classifi ed big data. The example focuses on the essential mechanics

of MapReduce so the size doesn't really matter. I use two document databases, MongoDB and

CouchDB, in this example. The concept of MapReduce is not specifi c to these products and

applies to a large variety of NoSQL products including sorted, ordered column-family stores, and

distributed key/value maps. I start with document databases because they require the least amount

of effort around installation and setup and are easy to test in local standalone mode. MapReduce

with Hadoop and HBase is included later in this chapter.

To get started, download the zipped archive fi les for the daily NYSE market data from 1970 to

2010 from http://infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-

volume-nyse-exchange . Extract the zip fi le to a local folder. The unzipped NYSE data set contains

a number of fi les. Among these are two types of fi les: daily market data fi les and dividend data fi les.

For the sake of simplicity, I upload only the daily market data fi les into a database collection. This

means the only fi les you need from the set are those whose names start with NYSE_daily_prices_

and include a number or a letter at the end. All such fi les that have a number appended to the end

contain only header information and so can be skipped.

The database and collection in MongoDB are named mydb and nyse , respectively. The database in

CouchDB is named nyse . The data is available in comma-separated values ( .csv ) format, so

I leverage the mongoimport utility to import this data set into a MongoDB collection. Later in this

chapter, I use a Python script to load the same .csv fi les into CouchDB.

The mongoimport utility and its output, when uploading NYSE_daily_prices_A.csv, are as

follows:

~/Applications/mongodb/bin/mongoimport --type csv --db mydb --collection nyse --

headerline NYSE_daily_prices_A.csv

connected to: 127.0.0.1

4981480/40990992 12%

89700 29900/second

10357231/40990992 25%

185900 30983/second

15484231/40990992 37%

278000 30888/second

20647430/40990992 50%

370100 30841/second

25727124/40990992 62%

462300 30820/second

30439300/40990992 74%

546600 30366/second

35669019/40990992 87%

639600 30457/second

40652285/40990992 99%

729100 30379/second

imported 735027 objects

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home