Information Technology Reference
In-Depth Information
mining [24, 25] provides this capability by allowing one to specify a uniform
weight to represent per-item profit and a transaction weight to represent a quan-
tity value. Objective-oriented association rule mining [26] methods, which make
it possible to measure how well an association rule meets a user's objective, can
be used to find association rules in a medical dataset where only treatments that
have minimal side effects and minimum levels of effectiveness are considered.
2.4.1.2 Redefine the Problem One way to deal with a difficult problem is to
convert it into a simpler problem. The fact that the problem is not an equiva-
lent problem may be outweighed by the improvement in results. This topic has
received very little attention in the research community, most likely because
it is not viewed as a research-oriented solution and is highly domain specific.
Nonetheless, this is a valid approach that should be considered. One relatively
general method for redefining a learning problem with imbalanced data is to
focus on a subdomain or partition of the data, where the degree of imbalance
is lessened. As long as this subdomain or partition is easily identified, this is
a viable strategy. It may also be a more reasonable strategy than removing the
imbalance artificially via sampling. As a simple example, in medical diagnosis,
one could restrict the population to people over 90 years of age, especially if
the targeted disease tends to be more common in the aged. Even if the disease
occurs much more rarely in the young, using the entire population for the study
could complicate matters if the people under 90, because of their much larger
numbers, collectively contribute more examples of the disease. Thus, the strategy
is to find a subdomain where the data is less imbalanced, but where the subdo-
main is still of sufficient interest. Other alternative strategies might be to group
similar rare classes together and then simplify the problem by predicting only
this “super-class.”
2.4.2 Data-Level Methods
The main data-level issue identified earlier involves absolute rarity and a lack
of sufficient examples belonging to rare classes and, in some cases, to the rare
cases that may reside in either a rare or a common class. This is a very difficult
issue to address, but methods for doing this are described in this section. This
section also describes methods for dealing with relative rarity (the standard class
imbalance problem), even though, as we shall discuss, we believe that issues
with relative rarity are best addressed at the algorithms level.
2.4.2.1 Active Learning and Other Information Acquisition Strategies The
most direct way of addressing the issue of absolute rarity is to acquire addi-
tional labeled training data. Randomly acquiring additional labeled training data
will be helpful and there are heuristic methods to determine whether the projected
improvement in classification performance warrants the cost of obtaining more
training data — and how many additional training examples should be acquired
[27]. But a more efficient strategy is to preferentially acquire data from the rare
Search WWH ::




Custom Search