FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

mining [24, 25] provides this capability by allowing one to specify a uniform

weight to represent per-item profit and a transaction weight to represent a quan-

tity value. Objective-oriented association rule mining [26] methods, which make

it possible to measure how well an association rule meets a user's objective, can

be used to find association rules in a medical dataset where only treatments that

have minimal side effects and minimum levels of effectiveness are considered.

2.4.1.2 Redefine the Problem One way to deal with a difficult problem is to

convert it into a simpler problem. The fact that the problem is not an equiva-

lent problem may be outweighed by the improvement in results. This topic has

received very little attention in the research community, most likely because

it is not viewed as a research-oriented solution and is highly domain specific.

Nonetheless, this is a valid approach that should be considered. One relatively

general method for redefining a learning problem with imbalanced data is to

focus on a subdomain or partition of the data, where the degree of imbalance

is lessened. As long as this subdomain or partition is easily identified, this is

a viable strategy. It may also be a more reasonable strategy than removing the

imbalance artificially via sampling. As a simple example, in medical diagnosis,

one could restrict the population to people over 90 years of age, especially if

the targeted disease tends to be more common in the aged. Even if the disease

occurs much more rarely in the young, using the entire population for the study

could complicate matters if the people under 90, because of their much larger

numbers, collectively contribute more examples of the disease. Thus, the strategy

is to find a subdomain where the data is less imbalanced, but where the subdo-

main is still of sufficient interest. Other alternative strategies might be to group

similar rare classes together and then simplify the problem by predicting only

this “super-class.”

2.4.2 Data-Level Methods

The main data-level issue identified earlier involves absolute rarity and a lack

of sufficient examples belonging to rare classes and, in some cases, to the rare

cases that may reside in either a rare or a common class. This is a very difficult

issue to address, but methods for doing this are described in this section. This

section also describes methods for dealing with relative rarity (the standard class

imbalance problem), even though, as we shall discuss, we believe that issues

with relative rarity are best addressed at the algorithms level.

2.4.2.1 Active Learning and Other Information Acquisition Strategies The

most direct way of addressing the issue of absolute rarity is to acquire addi-

tional labeled training data. Randomly acquiring additional labeled training data

will be helpful and there are heuristic methods to determine whether the projected

improvement in classification performance warrants the cost of obtaining more

training data — and how many additional training examples should be acquired

[27]. But a more efficient strategy is to preferentially acquire data from the rare

Search WWH ::

Custom Search

Home