Database Reference
In-Depth Information
one, the new one replaces the old sub-tree. CVFDT learns a model which is
similar in accuracy to the one that would be learned by reapplying VFDT
to a moving window of examples every time a new example arrives, but
with
O
(1) complexity per example, as opposed to
O
(
w
), where
w
is the size
of the window.
Black and Hickey (1999) offer a new approach to handling the
aforementioned sub-tasks dealing with drift within incremental learning
methods. Instead of utilizing the time-windowing approach presented thus
far, they employ a new purging mechanism to remove examples that are no
longer valid while retaining valid examples, regardless of age. As a result,
the example base grows, thus assisting good classification. Black and Hickey
describe an algorithm called CD3, which utilizes ID3 with post-pruning,
based on the time-stamp attribute relevance or TSAR approach.
In this approach, the time-stamp is treated as an attribute, and its
value is added as an additional input attribute to the examples description,
later to be used in the induction process. Consequently, if the time-stamp
attribute appears in the decision tree, the implication is that it is relevant
to classification. This, in turn, means that drift has occurred. Routes where
the value of the time-stamp attribute refers to the old period (or periods)
represent invalid rules. When the process is stable for a suciently long
period, the time-stamp attribute should not appear in any path of the tree.
The CD3 algorithm sustains a set of examples regarded as valid. This
set, referred to as the current example base, must be updated before another
round of learning can take place. Using invalid rules extracted from the CD3
tree, any example whose description matches (i.e. is covered by) that of an
invalid rule can be removed from the current example set. This process of
deletion is referred to as purging the current example set.
11.8 Decision Trees Inducers for Large Datasets
While the computaitonal complexity of decision tree induction algorithms is
considered to be relatively low, it can still come across diculties when the
datasets are big and in particular if we are interested in building a forest.
With the recent growth in the amount of data collected by information
systems there is a need for decision trees that can handle large datasets.
Big data is a term coined recently to refer large datasets that are too
dicult to process using existing methods. Several improvements to decision
tree induction algorithms have been suggested to address big data. In this
chapter, we will review the main methods.
Search WWH ::




Custom Search