Information Technology Reference
In-Depth Information
categories/levels: problem definition issues, data issues, and algorithm issues.
Each of these categories is briefly introduced and then described in detail in
subsequent sections.
Problem definition issues occur when one has insufficient information to prop-
erly define the learning problem. This includes the situation when there is no
objective way to evaluate the learned knowledge, in which case one cannot learn
an optimal classifier. Unfortunately, issues at the problem definition level are
commonplace. Data issues concern the actual data that is available for learning
and includes the problem of absolute rarity, where there are insufficient examples
associated with one or more classes to effectively learn the class. Finally, algo-
rithm issues occur when there are inadequacies in a learning algorithm that make
it perform poorly for imbalanced data. A simple example involves applying an
algorithm designed to optimize accuracy to an imbalanced learning problem
where it is more important to classify minority class examples correctly than
to classify majority class examples correctly.
2.3.1 Problem-Definition-Level Issues
A key task in any problem-solving activity is to understand the problem. As just
one example, it is critically important for computer programmers to understand
their customer's requirements before designing, and then implementing a software
solution. Similarly, in data mining, it is critical for the data-mining practitioner
to understand the problem and the user requirements. For classification tasks,
this includes understanding how the performance of the generated classifier will
be judged. Without such an understanding, it will be impossible to design an
optimal or near-optimal classifier. While this need for evaluation information
applies to all data-mining problems, it is particularly important for problems
with class imbalance. In these cases, as noted earlier, the costs of errors are often
asymmetric and quite skewed, which violates the default assumption of most
classifier induction algorithms, which is that errors have uniform cost and thus
accuracy should be optimized. The impact of using accuracy as an evaluation
metric in the presence of class imbalance is well known — in most cases, poor
minority class performance is traded off for improved majority class performance.
This makes sense from an optimization standpoint, as overall accuracy is the
weighted average of the accuracies associated with each class, where the weights
are based on the proportion of training examples belonging to each class. This
effect was clearly evident in Figure 2.1, which showed that the minority class
examples have a much lower accuracy than majority class examples. What was
not shown in Figure 2.1, but is shown by the underlying data [4], is that minority
class predictions occur much less frequently than majority class predictions, even
after factoring in the degree of class imbalance.
Accurate classifier evaluation information, if it exists, should be passed to
the classifier induction algorithm. This can be done in many forms, one of the
simplest forms being a cost matrix. If this information is available, then it is the
Search WWH ::




Custom Search