Database Reference
In-Depth Information
distributed applications.” [36] Instead of building its own coordination
service, HBase uses Zookeeper. Relative to HBase, there are some
Zookeeper configuration considerations [37].
10.2.4 Mahout
The majority of this chapter has focused on processing, structuring, and storing
large datasets using Apache Hadoop and various parts of its ecosystem. After a
dataset is available in HDFS, the next step may be to apply an analytical technique
presented in Chapters 4 through 9. Tools such as R are useful for analyzing
relatively small datasets, but they may suffer from performance issues with the
large datasets stored in Hadoop. To apply the analytical techniques within the
Hadoop environment, an option is to use Apache Mahout. This Apache project
provides executable Java libraries to apply analytical techniques in a scalable
manner to Big Data. In general, a mahout is a person who controls an elephant.
Apache Mahout is the toolset that directs Hadoop, the elephant in this case, to yield
meaningful analytic results.
Mahout provides Java code that implements the algorithms for several techniques
in the following three categories [38]:
Classification:
• Logistic regression
• Naïve Bayes
• Random forests
• Hidden Markov models
Clustering:
• Canopy clustering
• K-means clustering
• Fuzzy k-means
• Expectation maximization (EM)
Recommenders/collaborative filtering:
• Nondistributed recommenders
• Distributed item-based collaborative filtering
Search WWH ::




Custom Search