Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

reinforces the efficiency of decision trees. However, once a bad split is taken, it

is propagated through the rest of the tree. To address this problem, an ensemble

technique (such as random forest) may randomize the splitting or even randomize

data and come up with a multiple tree structure. These trees then vote for each

class, and the class with the most votes is chosen as the predicted class.

There are a few ways to evaluate a decision tree. First, evaluate whether the splits

of the tree make sense. Conduct sanity checks by validating the decision rules with

domain experts, and determine if the decision rules are sound.

Next, look at the depth and nodes of the tree. Having too many layers and

obtaining nodes with few members might be signs of overfitting. In overfitting, the

model fits the training set well, but it performs poorly on the new samples in the

testing set. Figure 7.7 illustrates the performance of an overfit model. The x -axis

represents the amount of data, and the y -axis represents the errors. The blue curve

is the training set, and the red curve is the testing set. The left side of the gray

vertical line shows that the model predicts well on the testing set. But on the right

side of the gray line, the model performs worse and worse on the testing set as

more and more unseen data is introduced.

Figure 7.7 An overfit model describes the training data well but predicts poorly

on unseen data

For decision tree learning, overfitting can be caused by either the lack of training

data or the biased data in the training set. Two approaches [10] can help avoid

overfitting in decision tree learning.

• Stop growing the tree early before it reaches the point where all the

training data is perfectly classified.

• Grow the full tree, and then post-prune the tree with methods such as

reduced-error pruning and rule-based post pruning.

Last, many standard diagnostics tools that apply to classifiers can help evaluate

overfitting. These tools are further discussed in Section 7.3.

Search WWH ::

Custom Search

Home