Database Reference
In-Depth Information
Other decision tree inducers for large datasets can be found in the works
of [ Alsabti et al . (1998) ] , [ Freitas and Lavington (1998) ] and [ Gehrke et al .
(1999) ] .
Ping et al. (2007) introduced the Fast C4.5 (FC4.5) algorithm. This
new C4.5 implementation relaxes the memory restriction that the original
C4.5 implementation had. The original C4.5 implementation was developed
on the early nineties when main-memory was still scarce and therefore it
was designed to use as little space as possible. However, nowadays the
main memory are plentiful. Therefore by trading memory with speed it is
possible to greatly accelerate the tree growing of the C4.5 algorithm. One
well-known bottleneck in the induction of decision trees is the fact that
the training instances are unordered. However, for finding the best split for
continues attributes the instances should be sorted. Moreover, the need for
sorting the instances is repeated for every new tree node that is explored
andasindicatedbyMehta et al. (1996), “for numeric attributes, sorting
time is the dominant factor when finding the best split at a decision tree
node”.
By adding a preprocessing step that is executed just before the tree
growing one can extract the order information of all the instances on every
attribute in advance. Specifically, every continuous attribute column is
ordered and the instance rank in the column is attached to the attribute
value. Then during the actual tree building, whenever we need to sort the
values instead of using quick-sort we can use indirect bucket-sort instead.
Finally when a certain numeric attributes is selected, FC4.5 searches for
the cutoff using the binary-search within the narrowest range.
Moreover, the sort order for the descendant nodes can be derived from
the sort order of the parent by a simple procedure with time complexity
of
). However, for doing that we
will need to store an array of sorted indices for each numeric attribute.
In fact, the idea of presorting the instances for accelerating the
induction of a classification tree is being around since the second half of the
nineties. Mehta et al . (1996) suggest using a dedicated data structure which
holds a separate list for each input attribute. In addition, a class list holds
the class labels that are assigned to the training instances. An entry in an
attribute list has two values: the first value contains the actual attribute
value, the second value contains the corresponding index in the class list.
Similarly, an entry in the class list refers to a certain instance in the training
set and also has two values: one contains the class label, the second value
identifies the leaf node in the tree to which an instance belongs. Initially,
O
(
n
) instead of the complexity of
O
(
nlogn
Search WWH ::




Custom Search