Databases Reference
In-Depth Information
We obtained an illustrative result by processing 10 GB of standard Apache log files
with the Cascading log analysis example. 9 Using Hadoop 0.19.1, Cascading 1.0.9, and
the previously mentioned node configuration, we obtained the number of Apache hits
per minute with this example by bucketing the hits in MapReduce jobs. We wrote a
naive single-node Perl
hash-based program as an example of a typical quick solution
a sysadmin may create. The results shown in table 12.3 confirm that our results easily
achieve linear (or better) speed-up with the simple addition of more nodes to the clus-
ter. Times are the average of 10 mixed executions, to allow for variances.
Table 12.3 Apache
log processing with Cascading
System
Performance measure
Result
Runtime
21m46s
1 Node
Sec/MB
0.127
Sec/MB/Node
0.127
Runtime
8m3s
3 Nodes
Sec/MB
0.0471
Sec/MB/Node
0.0157
Runtime
1m30s
15 Nodes
Sec/MB
0.00878
Sec/MB/Node
0.000585
Runtime
42m49s
Naive Perl
Sec/MB
0.251
Sec/MB/Node
0.251
We see that even the single-node Cascading solution achieves double the throughput
of the naive Perl application due to the intelligent segmentation and bucketing built in
to the MapReduce framework versus the effect of keeping all data mapped to a single
Perl hash. Given familiarity with Cascading, you may also consider the Perl code more
complex to optimize (and maintain) to boot!
To wit, StumbleUpon uses the native map and reduce functionality in Hadoop and
related products, including Nutch
and custom-written content surveyors, to perform
this data retrieval, analysis, and storage. Keeping the resultant data close to the
processing pipeline maximizes our data locality benefits.
Putting it all together, StumbleUpon has taken the maximum advantage of the vast
power the MapReduce paradigm unlocks by adopting and extending Hadoop, HDFS,
and HBase. We're excited to help lead the future of distributed processing.
12.4 Building analytics for enterprise search—IBM's Project ES2
Contributed by VUK ERCEGOVAC , RAJASEKAR KRISHNAMURTHY , SRIRAM RAGHAVAN , FREDERICK REISS ,
EUGENE SHEKITA , SANDEEP TATA , SHIVAKUMAR VAITHYANATHAN , and HUAIYU ZHU
In contrast with the radical advances in web search over the last several years, search
over enterprise intranets has remained a difficult and largely unsolved problem. Based
9 http://code.google.com/p/cascading/wiki/ApacheLogCascade.
 
Search WWH ::




Custom Search