Scalable Parallel Processing with MapReduce - Professional NoSQL

Databases Reference

In-Depth Information

Other daily price data fi les are uploaded in a similar fashion. To avoid sequential and tedious

upload of 36 different fi les you could consider automating the task using a shell script as included in

Listing 11-1.

LISTING 11-1: infochimps_nyse_data_loader.sh

Available for

download on

Wrox.com

#!/bin/bash

FILES=./infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_*.csv

for f in $FILES

do

echo “Processing $f file...”

# set MONGODB_HOME environment variable to point to the MongoDB installation

folder.

ls -l $f

$MONGODB_HOME/bin/mongoimport --type csv --db mydb --collection nyse --

headerline $f

Done

infochimps_nyse_data_loader.sh

Once the data is uploaded, you can verify the format by querying for a single document as follows:

> db.nyse.findOne();

{

“_id” : ObjectId(“4d519529e883c3755b5f7760”),

“exchange” : “NYSE”,

“stock_symbol” : “FDI”,

“date” : “1997-02-28”,

“stock_price_open” : 11.11,

“stock_price_high” : 11.11,

“stock_price_low” : 11.01,

“stock_price_close” : 11.01,

“stock_volume” : 4200,

“stock_price_adj_close” : 4.54

}

Next, MapReduce can be used to manipulate the collection. Let the fi rst of the tasks be to fi nd the

highest stock price for each stock over the entire data that spans the period between 1970 and 2010.

MapReduce has two parts: a map function and a reduce function. The two functions are applied to

data sequentially, though the underlying system frequently runs computations in parallel. Map takes

in a key/value pair and emits another key/value pair. Reduce takes the output of the map phase and

manipulates the key/value pairs to derive the fi nal result. A map function is applied on each item

in a collection. Collections can be large and distributed across multiple physical machines. A map

function runs on each subset of a collection local to a distributed node. The map operation on one

node is completely independent of a similar operation on another node. This clear isolation provides

effective parallel processing and allows you to rerun a map function on a subset in cases of failure.

After a map function has run on the entire collection, values are emitted and provided as input to the

reduce phase. The MapReduce framework takes care of collecting and sorting the output from

the multiple nodes and making it available from one phase to the other.

Professional NoSQL

Search WWH ::

Custom Search

Home