Database Reference
In-Depth Information
were more than just tall, they were organized into positions such as paint protector and
scoring rebounder . This type of understanding could help NBA teams take a Moneyball
approach to the game. Usually, tall centers are the hardest players to find and are paid
accordingly. Perhaps instead of choosing a more expensive center, a team could get
the same results from a shorter, under-valued player who falls into one of these new
categories.
Alagappan's analysis is an example of cluster analysis. Clustering problems are those
in which groups are inherent in the data in which some individual data points are
more similar to certain data points than to others. Cluster analysis sorts the data into
groups. As menitoned earlier, one of the most popular and simple clustering algorithms
is known as k-means. Besides having a catchy name, k-means clustering is a fast way
to group data points into a pre-determined number of clusters. For example, imagine a
set of points described by two variables. The k-means approach randomly chooses data
points as the cluster center and then computes the average distance from this point to
all other points in the system. From this data, new centers are chosen, and the process
continues. The process is iterative and can continue until the sum-of-squares value is
as small as possible. This results in a collection of clustered data points that can only
fall into a single group. There are many variations of this approach, including fuzzy
k-means clustering that allows data points to fall into multiple groups. The major
drawback to k-means clustering is that one must know how many clusters should be
created before applying the algorithm. In other words, this would not have been help-
ful in the NBA example given previously. Luckily, there are many other and many
more complex clustering algorithms available.
Despite the complexity of the clustering-algorithm space, it's becoming easier and
easier to implement over collected data without understanding exactly what is being
produced. Much like using classification algorithms such as the Bayesian approach, it's
important to understand what the clustering problem being considered is and what the
data challenge is. Are you trying to find new categories of customer? Or are you try-
ing to figure out how similar a particular data point is to others? Each of these prob-
lems requires a different approach, as well as a different clustering algorithm.
Recommendation Engines
One particular problem domain has captured the attention of machine learning
experts, a problem into which they can pour all the global computer power at their
disposal. What is this critical social problem? Why, it's movie ratings, of course! More
specifically, the problem is that of being able to recommend movies based on a par-
ticular user's past viewing history.
In 2006, online movie service Netf lix announced that it would hold a public
competition to discover an algorithm that could beat its current recommendation
engine. So valuable is this feature to Netf lix that the prize offered was substantial:
one million dollars to the team of researchers who could best the existing engine by
a certain accuracy percentage. The Netf lix contest also generated a great deal of con-
troversy. Although Netf lix attempted to anonymize the data used in the competition,
 
Search WWH ::




Custom Search