Database Reference
In-Depth Information
the complex nature of machine learning algorithms, the potential for misuse of a tech-
nique for a particular challenge is strong. Furthermore, within each class of machine
learning solutions there are often many algorithmic variations. The machine learning
field has many pitfalls, and fallacies abound. As more tools become turnkey and acces-
sible to engineers, the opportunities to derive untenable results from them grow. It's
very important to make sure that the machine learning algorithm you choose is a viable
model for your specific data challenge. For example, the field of clustering algorithms
features an enormous variety of models for grouping similar data points. A k-means clus-
tering algorithm places data points discretely in a particular group, whereas its cousin
the fuzzy k-means can result in data points in more than one group. The choice of one
versus the other is dependent on how well it addresses the problem being solved. These
algorithms are discussed in more detail later in the chapter.
Providing a significant sample size is also important to building statistical mod-
els for predictive purposes. In many cases, only a small dataset is necessary to train a
machine model. For example, Bayesian classification techniques often don't require
massive datasets to build adequate predictive models. In some cases, using larger and
larger datasets to train machine learning systems may ultimately be a waste of time and
resources. Never use a distributed-processing approach unless absolutely necessary.
Bias-variance trade-off is another fundamental place for potential pitfalls when con-
sidering using a machine learning approach. Imagine that we have created a linear
regression model. Our data can be arranged on a two-dimensional axis. Our model
attempts to describe the relationship between two variables and assumes that the error
components of both are independent of one another. When trying to build a regres-
sion line, our linear model would attempt to fit a line through the data points. Not
all of our data points will sit on this line (in fact, perhaps very few or none will). This
model has a certain type of bias problem; most points won't touch the regression line.
New data that is applied to this model will also not appear on the line. However, the
variance of predicted values will likely be low. For each new value associated with the
regression line, the distance from the line may be quite small. In other words, nearly
every predicted value will not be exactly correct, but each will only be incorrect by a
small amount.
A more complex model might incorporate a line that touches every point explicitly.
However, this decrease in variance means that the model is tightly connected to the
data itself. The bias of the model is very low; it matches the observed data incredibly
well, but new values may not fit well into the model. Consider the effectiveness of the
predictive models for the use cases you are faced with. In some cases, variance can be
minimized by using a more highly biased model.
Bayesian Classification
If you have an email address, chances are that you are already participating in a mas-
sive, worldwide, cyberspace arms race. The battleground is your inbox, and the bad
guys are fighting with spam. Without proper spam filtering, email systems would
become useless messes full of unwanted advertising. Use cases like spam detection are
 
Search WWH ::




Custom Search