Instance Selection - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

contains only cases of the same class as instance x i . Authors define two properties,

reachability and coverage :

X i

X i ) } ,

erage

(

X i ) ={

∈

X i

∈

(

(8.4)

X i

Reachability

(

X i ) ={

∈

(

X i ) } ,

(8.5)

In the first phase ICF uses the ENN algorithm to remove the noise from the training

set. In the second phase the ICF algorithm removes each instance X i for which the

Reachability

. This procedure is repeated

for each instance in TR . After that ICF recalculates reachability and coverage

properties and restarts the second phase (as long as any progress is observed).

(

X i )

is bigger than the Co

erage

(

X i )

•

Hit-Miss Network Algorithms (HMN) [ 116 ]—Hit-Miss Networks are directed

graphs where the points are the instances in TR and the edges are connections

between an instance and its nearest neighbor from each class. The edge connecting

a instance with a neighbor of its same class is called “Hit”, and the rest of its edges

are called “Miss”.

The HMNC method builds the network and removes all nodes not connected. The

rest of the nodes are employed to build the final S set.

Hit Miss Network Edition ( HMNE ): Starting from the output of HMNC, HMNE

applies these four rules to prune the network:

1. Every point with more “Miss”edges than “Hit” edges is flagged for removal.

2. If the size of non flagged points of a class is too low, edges with at least one “Hit”

from those classes are unflagged.

3. If there are more than three classes, some points of each class with low number

of “Miss” are unflagged.

4. Points which are the “Hit” of a 25 % or more instances of a class are unflagged.

Finally, the instances which remain flagged for removal are deleted from the net-

work, in order to build the final S set.

Hit Miss Network Edition Iterative ( HMNEI ): The HMNE method can be

employed iteratively until the generalization accuracy of 1-NN on the original

training set with the reduced set decreases.

8.4.3.6 Mixed+Wrapper

•

Encoding Length Familiy (Explore) [ 18 ] Cameron-Jones used an encoding

length heuristic to determine how good the subset S is in describing TR .His

algorithms use cost function defined by:

COST

(

) =

(

) +

s log 2 () +

(

−

) +

x log 2 ( −

)

(8.6)

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home