Advanced Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

Other decision tree inducers for large datasets can be found in the works

of [ Alsabti et al . (1998) ] , [ Freitas and Lavington (1998) ] and [ Gehrke et al .

(1999) ] .

Ping et al. (2007) introduced the Fast C4.5 (FC4.5) algorithm. This

new C4.5 implementation relaxes the memory restriction that the original

C4.5 implementation had. The original C4.5 implementation was developed

on the early nineties when main-memory was still scarce and therefore it

was designed to use as little space as possible. However, nowadays the

main memory are plentiful. Therefore by trading memory with speed it is

possible to greatly accelerate the tree growing of the C4.5 algorithm. One

well-known bottleneck in the induction of decision trees is the fact that

the training instances are unordered. However, for finding the best split for

continues attributes the instances should be sorted. Moreover, the need for

sorting the instances is repeated for every new tree node that is explored

andasindicatedbyMehta et al. (1996), “for numeric attributes, sorting

time is the dominant factor when finding the best split at a decision tree

node”.

By adding a preprocessing step that is executed just before the tree

growing one can extract the order information of all the instances on every

attribute in advance. Specifically, every continuous attribute column is

ordered and the instance rank in the column is attached to the attribute

value. Then during the actual tree building, whenever we need to sort the

values instead of using quick-sort we can use indirect bucket-sort instead.

Finally when a certain numeric attributes is selected, FC4.5 searches for

the cutoff using the binary-search within the narrowest range.

Moreover, the sort order for the descendant nodes can be derived from

the sort order of the parent by a simple procedure with time complexity

of

). However, for doing that we

will need to store an array of sorted indices for each numeric attribute.

In fact, the idea of presorting the instances for accelerating the

induction of a classification tree is being around since the second half of the

nineties. Mehta et al . (1996) suggest using a dedicated data structure which

holds a separate list for each input attribute. In addition, a class list holds

the class labels that are assigned to the training instances. An entry in an

attribute list has two values: the first value contains the actual attribute

value, the second value contains the corresponding index in the class list.

Similarly, an entry in the class list refers to a certain instance in the training

set and also has two values: one contains the class label, the second value

identifies the leaf node in the tree to which an instance belongs. Initially,

O

(

n

) instead of the complexity of

O

(

nlogn

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home