Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

were more than just tall, they were organized into positions such as paint protector and

scoring rebounder . This type of understanding could help NBA teams take a Moneyball

approach to the game. Usually, tall centers are the hardest players to find and are paid

accordingly. Perhaps instead of choosing a more expensive center, a team could get

the same results from a shorter, under-valued player who falls into one of these new

categories.

Alagappan's analysis is an example of cluster analysis. Clustering problems are those

in which groups are inherent in the data in which some individual data points are

more similar to certain data points than to others. Cluster analysis sorts the data into

groups. As menitoned earlier, one of the most popular and simple clustering algorithms

is known as k-means. Besides having a catchy name, k-means clustering is a fast way

to group data points into a pre-determined number of clusters. For example, imagine a

set of points described by two variables. The k-means approach randomly chooses data

points as the cluster center and then computes the average distance from this point to

all other points in the system. From this data, new centers are chosen, and the process

continues. The process is iterative and can continue until the sum-of-squares value is

as small as possible. This results in a collection of clustered data points that can only

fall into a single group. There are many variations of this approach, including fuzzy

k-means clustering that allows data points to fall into multiple groups. The major

drawback to k-means clustering is that one must know how many clusters should be

created before applying the algorithm. In other words, this would not have been help-

ful in the NBA example given previously. Luckily, there are many other and many

more complex clustering algorithms available.

Despite the complexity of the clustering-algorithm space, it's becoming easier and

easier to implement over collected data without understanding exactly what is being

produced. Much like using classification algorithms such as the Bayesian approach, it's

important to understand what the clustering problem being considered is and what the

data challenge is. Are you trying to find new categories of customer? Or are you try-

ing to figure out how similar a particular data point is to others? Each of these prob-

lems requires a different approach, as well as a different clustering algorithm.

Recommendation Engines

One particular problem domain has captured the attention of machine learning

experts, a problem into which they can pour all the global computer power at their

disposal. What is this critical social problem? Why, it's movie ratings, of course! More

specifically, the problem is that of being able to recommend movies based on a par-

ticular user's past viewing history.

In 2006, online movie service Netf lix announced that it would hold a public

competition to discover an algorithm that could beat its current recommendation

engine. So valuable is this feature to Netf lix that the prize offered was substantial:

one million dollars to the team of researchers who could best the existing engine by

a certain accuracy percentage. The Netf lix contest also generated a great deal of con-

troversy. Although Netf lix attempted to anonymize the data used in the competition,

Search WWH ::

Custom Search

Home