Database Reference
In-Depth Information
almost impossible to handle without using some sort of machine learning approach.
The data sizes are too large and the problems too critical to handle in other ways. But
how do these systems work?
One of the most common approaches to classification is to use an algorithm called
a naïve Bayesian classifier . In short, a Bayes model treats incoming data as having a set
of features, each feature independent of one another. In the case of spam emails, these
features may be the individual words of the email itself. Spam email will tend to fea-
ture sensational advertising terms and words in all caps. By applying a probabilistic
score to each feature, the Bayes classifier can produce a very good sense of the type of
the email. If a particular email scores highly on multiple features, it is very likely to be
classified as spam. This model is also indicative of the rich history of machine learning.
Thomas Bayes was a Presbyterian minister and mathematician who lived in England
in the 1700s. Bayes' probabilistic innovations predated not only computer science but
much of modern statistics.
Like many other machine learning applications, a simple Bayesian classifier has
drawbacks, and certain assumptions need to be met for the algorithm to be effective.
The Bayesian approach can sometimes produce poor results if the training data is not
particularly well differentiated or when there are too few examples of one type of
classification in the training model. As a result, there are many modifications to the
simple Bayesian approach and many other classification algorithms. With this in mind,
for many practical applications, the naïve Bayesian approach is not always the most
accurate way to classify text, but it is effective and conceptually simple. It also doesn't
require a large amount of data to get started and can often produce acceptable results
for many applications.
Clustering
Michael Lewis's 2003 book, Moneyball: The Art of Winning an Unfair Game, told the
story of a professional baseball general manager named Billy Beane who had to field a
team that could compete against rosters with a far greater payroll. This forced Beane
to turn to statistical methods make up for the gap in talent. The topic popularized the
use of a quantitative approach to sport as Beane searched for overlooked players that
could statistically replicate the performance of more highly paid, established stars.
This frugal approach has also been used in realms outside of baseball. In the NBA,
professional basketball players are often classified into positions based on physical attri-
butes rather than statistical behavior. The five traditional positions in basketball are
often related to height. Guards tend to be shorter, faster, and responsible for handling
the ball, whereas centers are tall and responsible for staying near the basket. In 2012,
Stanford student Muthu Alagappan wondered if these positions were not created due
to a bias about height. In Alagappan's view, the game was much more dynamic, and
the traditional positions were not based on actual player statistics. Using a propri-
etary algorithm, he lumped players into 13 buckets based on statistical performance
in games. Guards were placed into new categories such as defensive ball handler and
shooting ball handler . Recognizing that centers, normally the tallest players on the f loor,
 
Search WWH ::




Custom Search