Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Decision trees

Decision tree model is a powerful, nonprobabilistic technique that can capture more com-

plex nonlinear patterns and feature interactions. They have been shown to perform well on

many tasks, are relatively easy to understand and interpret, can handle categorical and nu-

merical features, and do not require input data to be scaled or standardized. They are well

suited to be included in ensemble methods (for example, ensembles of decision tree mod-

els, which are called decision forests).

The decision tree model constructs a tree where the leaves represent a class assignment to

class 0 or 1, and the branches are a set of features. In the following figure, we show a

simple decision tree where the binary outcome is Stay at home or Go to the beach . The

features are the weather outside.

A simple decision tree

The decision tree algorithm is a top-down approach that begins at a root node (or feature),

and then selects a feature at each step that gives the best split of the dataset, as measured by

the information gain of this split. The information gain is computed from the node impurity

(which is the extent to which the labels at the node are similar, or homogenous) minus the

weighted sum of the impurities for the two child nodes that would be created by the split.

Search WWH ::

Custom Search

Home