Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

the complex nature of machine learning algorithms, the potential for misuse of a tech-

nique for a particular challenge is strong. Furthermore, within each class of machine

learning solutions there are often many algorithmic variations. The machine learning

field has many pitfalls, and fallacies abound. As more tools become turnkey and acces-

sible to engineers, the opportunities to derive untenable results from them grow. It's

very important to make sure that the machine learning algorithm you choose is a viable

model for your specific data challenge. For example, the field of clustering algorithms

features an enormous variety of models for grouping similar data points. A k-means clus-

tering algorithm places data points discretely in a particular group, whereas its cousin

the fuzzy k-means can result in data points in more than one group. The choice of one

versus the other is dependent on how well it addresses the problem being solved. These

algorithms are discussed in more detail later in the chapter.

Providing a significant sample size is also important to building statistical mod-

els for predictive purposes. In many cases, only a small dataset is necessary to train a

machine model. For example, Bayesian classification techniques often don't require

massive datasets to build adequate predictive models. In some cases, using larger and

larger datasets to train machine learning systems may ultimately be a waste of time and

resources. Never use a distributed-processing approach unless absolutely necessary.

Bias-variance trade-off is another fundamental place for potential pitfalls when con-

sidering using a machine learning approach. Imagine that we have created a linear

regression model. Our data can be arranged on a two-dimensional axis. Our model

attempts to describe the relationship between two variables and assumes that the error

components of both are independent of one another. When trying to build a regres-

sion line, our linear model would attempt to fit a line through the data points. Not

all of our data points will sit on this line (in fact, perhaps very few or none will). This

model has a certain type of bias problem; most points won't touch the regression line.

New data that is applied to this model will also not appear on the line. However, the

variance of predicted values will likely be low. For each new value associated with the

regression line, the distance from the line may be quite small. In other words, nearly

every predicted value will not be exactly correct, but each will only be incorrect by a

small amount.

A more complex model might incorporate a line that touches every point explicitly.

However, this decrease in variance means that the model is tightly connected to the

data itself. The bias of the model is very low; it matches the observed data incredibly

well, but new values may not fit well into the model. Consider the effectiveness of the

predictive models for the use cases you are faced with. In some cases, variance can be

minimized by using a more highly biased model.

Bayesian Classification

If you have an email address, chances are that you are already participating in a mas-

sive, worldwide, cyberspace arms race. The battleground is your inbox, and the bad

guys are fighting with spam. Without proper spam filtering, email systems would

become useless messes full of unwanted advertising. Use cases like spam detection are

Search WWH ::

Custom Search

Home