Advanced Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

one, the new one replaces the old sub-tree. CVFDT learns a model which is

similar in accuracy to the one that would be learned by reapplying VFDT

to a moving window of examples every time a new example arrives, but

with

O

(1) complexity per example, as opposed to

O

(

w

), where

w

is the size

of the window.

Black and Hickey (1999) offer a new approach to handling the

aforementioned sub-tasks dealing with drift within incremental learning

methods. Instead of utilizing the time-windowing approach presented thus

far, they employ a new purging mechanism to remove examples that are no

longer valid while retaining valid examples, regardless of age. As a result,

the example base grows, thus assisting good classification. Black and Hickey

describe an algorithm called CD3, which utilizes ID3 with post-pruning,

based on the time-stamp attribute relevance or TSAR approach.

In this approach, the time-stamp is treated as an attribute, and its

value is added as an additional input attribute to the examples description,

later to be used in the induction process. Consequently, if the time-stamp

attribute appears in the decision tree, the implication is that it is relevant

to classification. This, in turn, means that drift has occurred. Routes where

the value of the time-stamp attribute refers to the old period (or periods)

represent invalid rules. When the process is stable for a suciently long

period, the time-stamp attribute should not appear in any path of the tree.

The CD3 algorithm sustains a set of examples regarded as valid. This

set, referred to as the current example base, must be updated before another

round of learning can take place. Using invalid rules extracted from the CD3

tree, any example whose description matches (i.e. is covered by) that of an

invalid rule can be removed from the current example set. This process of

deletion is referred to as purging the current example set.

11.8 Decision Trees Inducers for Large Datasets

While the computaitonal complexity of decision tree induction algorithms is

considered to be relatively low, it can still come across diculties when the

datasets are big and in particular if we are interested in building a forest.

With the recent growth in the amount of data collected by information

systems there is a need for decision trees that can handle large datasets.

Big data is a term coined recently to refer large datasets that are too

dicult to process using existing methods. Several improvements to decision

tree induction algorithms have been suggested to address big data. In this

chapter, we will review the main methods.

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home