Visualizing Trees and Forests - Data Visualization

Graphics Reference

In-Depth Information

Other methods based on recursive partitioning of the plot space are treemaps and

spineplots of leaves. Both allow a concise view of all terminal nodes while retaining

hints of the splitting sequence. In conjunction with highlighting and brushing, the

mainfocushereisonthemodelbehaviorwithrespecttodatapoints.Assuchtheplots

can be created using training and test data separately and compared. Treemaps are

more suitable for absolute comparisons and large, complex trees, whereas spineplots

of leaves can be used for relative comparison of groups within terminal nodes up to

moderately complex trees.

Tree models are possibly unstable, that is, small changes in the data can lead to

entirely different trees. To analyze the stability of splits it is possible to visualize the

optimality criterion for candidate variables using mountain plots. Competing splits

within a variable become clearly visible and the comparison of mountain plots of

multiple candidate variables allows a quick assessment of the magnitude and cause

for potential instability.

he instability of a tree model can be used to obtain additional insight in the data

and to improve prediction accuracy. Bootstrapping provides a useful method for the

analysis ofmodelvariation bycreating awholesetoftreemodels.Visualization ofthe

useof covariates in the splits as weighted barcharts with aggregate impurity criterion

as weight allows quick assessment of variable importance. Variable masking can be

detected using weighted fluctuation diagrams of variables and trees. his view is also

useful for finding groups of related tree models.

Sectioned scatterplots alsoallowthevisualization ofpartition boundaries formul-

tiple trees. he resulting plot can no longer be used for global drill-down due to the

lack of shared subgroups, but it provides a way of analyzing the “fuzziness” of a cut-

point in conjunction with the data.

Finally, trace plots allow us to visualize split rules and the hierarchical structure

of arbitrarily many trees in a single view. hey are based on a grid of variables and

tree levels (nodes of the same depth) whereeach cell corresponds toa candidate split

variable, corresponding to a potential tree node. Actually used cells are connected in

the same way as in the hierarchical view, thus reflecting the full structure of the tree.

Multiple trees can be superimposed on this grid, each leaving its own “trace.” he

resulting plot shows frequently used paths, common subgroups, and alternate splits.

All plots in this chapter have been produced using R sotware for statistical com-

puting and KLIMT interactive sotware for visualization and analysis of trees and

forests. Visualization methods presented in this chapter are suitable for both presen-

tation of particular findings and exploratory work. he individual techniques com-

plementeach otherwellbyprovidingvarious different viewpoints onthe modelsand

data.hereforetheycanbesuccessfullyusedinaninteractiveframework.Traceplots,

forexample, represent a very useful overview that can be linked toindividual hierar-

chical views. Subgroups defined by cells in the trace plot can be linked to data-based

plots, its edges to sectioned scatterplots.

he methods presented here were mostly illustrated on classification examples,

but they can be equally used forregression trees and mostly for survival trees as well.

Also, all methods described here are not limited to binary trees, even though those

represent the most commonly used models. he variety of tree models and further

Data Visualization

Search WWH ::

Custom Search

Home