Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

distributed applications.” [36] Instead of building its own coordination

service, HBase uses Zookeeper. Relative to HBase, there are some

Zookeeper configuration considerations [37].

10.2.4 Mahout

The majority of this chapter has focused on processing, structuring, and storing

large datasets using Apache Hadoop and various parts of its ecosystem. After a

dataset is available in HDFS, the next step may be to apply an analytical technique

presented in Chapters 4 through 9. Tools such as R are useful for analyzing

relatively small datasets, but they may suffer from performance issues with the

large datasets stored in Hadoop. To apply the analytical techniques within the

Hadoop environment, an option is to use Apache Mahout. This Apache project

provides executable Java libraries to apply analytical techniques in a scalable

manner to Big Data. In general, a mahout is a person who controls an elephant.

Apache Mahout is the toolset that directs Hadoop, the elephant in this case, to yield

meaningful analytic results.

Mahout provides Java code that implements the algorithms for several techniques

in the following three categories [38]:

Classification:

• Logistic regression

• Naïve Bayes

• Random forests

• Hidden Markov models

Clustering:

• Canopy clustering

• K-means clustering

• Fuzzy k-means

• Expectation maximization (EM)

Recommenders/collaborative filtering:

• Nondistributed recommenders

• Distributed item-based collaborative filtering

Search WWH ::

Custom Search

Home