Databases Reference
In-Depth Information
Other daily price data fi les are uploaded in a similar fashion. To avoid sequential and tedious
upload of 36 different fi les you could consider automating the task using a shell script as included in
Listing 11-1.
LISTING 11-1: infochimps_nyse_data_loader.sh
Available for
download on
Wrox.com
#!/bin/bash
FILES=./infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_*.csv
for f in $FILES
do
echo “Processing $f file...”
# set MONGODB_HOME environment variable to point to the MongoDB installation
folder.
ls -l $f
$MONGODB_HOME/bin/mongoimport --type csv --db mydb --collection nyse --
headerline $f
Done
infochimps_nyse_data_loader.sh
Once the data is uploaded, you can verify the format by querying for a single document as follows:
> db.nyse.findOne();
{
“_id” : ObjectId(“4d519529e883c3755b5f7760”),
“exchange” : “NYSE”,
“stock_symbol” : “FDI”,
“date” : “1997-02-28”,
“stock_price_open” : 11.11,
“stock_price_high” : 11.11,
“stock_price_low” : 11.01,
“stock_price_close” : 11.01,
“stock_volume” : 4200,
“stock_price_adj_close” : 4.54
}
Next, MapReduce can be used to manipulate the collection. Let the fi rst of the tasks be to fi nd the
highest stock price for each stock over the entire data that spans the period between 1970 and 2010.
MapReduce has two parts: a map function and a reduce function. The two functions are applied to
data sequentially, though the underlying system frequently runs computations in parallel. Map takes
in a key/value pair and emits another key/value pair. Reduce takes the output of the map phase and
manipulates the key/value pairs to derive the fi nal result. A map function is applied on each item
in a collection. Collections can be large and distributed across multiple physical machines. A map
function runs on each subset of a collection local to a distributed node. The map operation on one
node is completely independent of a similar operation on another node. This clear isolation provides
effective parallel processing and allows you to rerun a map function on a subset in cases of failure.
After a map function has run on the entire collection, values are emitted and provided as input to the
reduce phase. The MapReduce framework takes care of collecting and sorting the output from
the multiple nodes and making it available from one phase to the other.
Search WWH ::




Custom Search