Databases Reference
In-Depth Information
With a computation for gain, we are now ready to answer the second question
above - “How should the best input attribute be chosen for a split?” The simple
answer is to choose the input attribute and split that maximizes the gain.
The type of split depends on the data type and distribution of the input
attribute. For example, suppose that input attribute W is nominal with two
different values (X and Y). The only possible split is a binary split placing all X
observations in one child node and all Yobservations in the other. However, if W
has three different values (X, Y, and Z), the possible splits are:
a binary split with all X observations in one node and all Y and Z
observations in a second node
a binary split with all Y observations in one node and all X and Z
observations in a second node
a binary split with all Z observations in one node and all X and Y
observations in a second node
a ternary split with all X observations in one node, all Y in a second node,
and all Z in a third.
The number of alternatives exponentially increases as the cardinality of the
nominal input attribute increases. To avoid this complexity, an acceptable rule
of thumb is to only consider binary splits.
When the input attribute is continuous (numeric) and only binary splits are
considered, a split position must be determined such that all observations
whose input attribute value is less than or equal to the split position are
assigned to the first child node and all other observations assigned to a second
node. Hence, the problem becomes locating the split position that maximizes
the gain for the input attribute under question. Depending on the number of
unique values for the input attribute in the node, this could become a rather
exhausting search. To simplify, one alternative is to choose a finite number of
equally spaced split positions (10 for example), evaluate each - choosing the
position yielding the greatest gain.
The algorithm to choose the best split becomes:
1. For each potential input attribute, find the split based on that attribute
yielding the best gain.
2. Choose the input attribute with the overall best gain.
Stopping the splitting process
This brings us to the final question of when to terminate the splitting process.
There are a number of possible “stop rules” that may be applied:
 
Search WWH ::




Custom Search