Database Reference
In-Depth Information
Ecient storage methods — enabling the algorithm to handle many
records. For instance, Shafer et al . (1996) presented the SPRINT which
constructs an attribute list data structure.
Reducing the algorithm's search space — For instance, the PUBLIC
algorithm [ Rastogi and Shim (2000) ] integrates the growing and pruning
of decision trees by using Minimum Description Length (MDL) approach
in order to reduce the computational complexity.
4.6 Robustness
The ability of the model to handle noise or data with missing values
and make correct predictions is called robustness. Different decision trees
algorithms have different robustness levels. In order to estimate the
robustness of a classification tree, it is common to train the tree on a clean
training set and then train a different tree on a noisy training set. The
noisy training set is usually the clean training set to which some artificial
noisy instances have been added. The robustness level is measured as the
difference in the accuracy of these two situations.
4.7 Stability
Formally, stability of a classification algorithm is defined as the degree to
which an algorithm generates repeatable results, given different batches of
data from the same process. In mathematical terms, stability is the expected
agreement between two models on a random sample of the original data,
where agreement on a specific example means that both models assign it to
the same class. The instability problem raises questions about the validity
of a particular tree provided as an output of a decision-tree algorithm. The
users view the learning algorithm as an oracle. Obviously, it is dicult to
trust an oracle that says something radically different each time you make
a slight change in the data.
Existing methods of constructing decision trees from data suffer from a
major problem of instability. The symptoms of instability include variations
in the predictive accuracy of the model and in the model's topology.
Instability can be revealed not only by using disjoint sets of data, but even
by replacing a small portion of training cases, like in the cross-validation
procedure. If an algorithm is unstable, the cross-validation results become
estimators with high variance which means that an algorithm fails to make
a clear distinction between persistent and random patterns in the data. As
Search WWH ::




Custom Search