Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

almost impossible to handle without using some sort of machine learning approach.

The data sizes are too large and the problems too critical to handle in other ways. But

how do these systems work?

One of the most common approaches to classification is to use an algorithm called

a naïve Bayesian classifier . In short, a Bayes model treats incoming data as having a set

of features, each feature independent of one another. In the case of spam emails, these

features may be the individual words of the email itself. Spam email will tend to fea-

ture sensational advertising terms and words in all caps. By applying a probabilistic

score to each feature, the Bayes classifier can produce a very good sense of the type of

the email. If a particular email scores highly on multiple features, it is very likely to be

classified as spam. This model is also indicative of the rich history of machine learning.

Thomas Bayes was a Presbyterian minister and mathematician who lived in England

in the 1700s. Bayes' probabilistic innovations predated not only computer science but

much of modern statistics.

Like many other machine learning applications, a simple Bayesian classifier has

drawbacks, and certain assumptions need to be met for the algorithm to be effective.

The Bayesian approach can sometimes produce poor results if the training data is not

particularly well differentiated or when there are too few examples of one type of

classification in the training model. As a result, there are many modifications to the

simple Bayesian approach and many other classification algorithms. With this in mind,

for many practical applications, the naïve Bayesian approach is not always the most

accurate way to classify text, but it is effective and conceptually simple. It also doesn't

require a large amount of data to get started and can often produce acceptable results

for many applications.

Clustering

Michael Lewis's 2003 book, Moneyball: The Art of Winning an Unfair Game, told the

story of a professional baseball general manager named Billy Beane who had to field a

team that could compete against rosters with a far greater payroll. This forced Beane

to turn to statistical methods make up for the gap in talent. The topic popularized the

use of a quantitative approach to sport as Beane searched for overlooked players that

could statistically replicate the performance of more highly paid, established stars.

This frugal approach has also been used in realms outside of baseball. In the NBA,

professional basketball players are often classified into positions based on physical attri-

butes rather than statistical behavior. The five traditional positions in basketball are

often related to height. Guards tend to be shorter, faster, and responsible for handling

the ball, whereas centers are tall and responsible for staying near the basket. In 2012,

Stanford student Muthu Alagappan wondered if these positions were not created due

to a bias about height. In Alagappan's view, the game was much more dynamic, and

the traditional positions were not based on actual player statistics. Using a propri-

etary algorithm, he lumped players into 13 buckets based on statistical performance

in games. Guards were placed into new categories such as defensive ball handler and

shooting ball handler . Recognizing that centers, normally the tallest players on the f loor,

Search WWH ::

Custom Search

Home