Database Reference
In-Depth Information
that the split will reduce the overall error rate by only 5% or less, the split
is regarded as invalid). In an NBTree, it is enough to examine only the
first proposed split in order to conclude that there are no valid splits, if
the one examined is invalid. This follows since in an NBTree, the attribute
according to which the split is done is the one that maximizes the utility
measure, which is strictly increasing with the reduction in error. If a split,
in accordance with the selected attribute cannot reduce the accuracy by
more than 5%, then no other split can.
We suggest a new split validation procedure. In very general terms, a
split according to the values of a certain attribute is regarded as invalid if
the sub-spaces that result from this split are similar enough to be grouped
together.
15.3 The Contrasted Population Miner (CPOM)
Algorithm
This section presents the CPOM, which splits nodes according to a novel
splitting rule, termed grouped gain ratio. Generally speaking, this splitting
rule is based on the gain ratio criterion [Quinlan (1993)], followed by a
grouping heuristic. The gain ratio criterion selects a single attribute from
the set of candidate attributes, and the grouping heuristic thereafter groups
together sub-spaces which correspond to different values of the selected
attribute.
15.3.1
CPOM Outline
CPOM uses two stopping rules. First, the algorithm compares the number
of training instances to a predefined ratio of the number of instances in the
original training set. If the subset is too small, CPOM stops (since it is
undesirable to learn from too small a training subset). Secondly, CPOM
compares the accuracy estimation of a single classifier to a pre-defined
threshold. It stops if the accuracy estimation exceeds the threshold (if a
single classifier is accurate enough, there is no point in splitting further
on). Therefore, in addition to the inputs in Figure 15.1, CPOM must receive
two parameters: β , the minimal ratio of the training instances and acc ,the
maximal accuracy estimation that will still result in split considerations.
CPOM's split validation procedure is directly based on grouped gain
ratio. The novel rule is described in detail, in the following subsection;
however, in general terms, the rule returns the splitting attribute and a set
of descendent nodes. The nodes represent sub-spaces of X that are believed
Search WWH ::




Custom Search