Visualizing Trees and Forests - Data Visualization

Graphics Reference

In-Depth Information

Introduction

10.1

Tree-basedmodelsprovideanappealingalternativetoconventionalmodelsformany

reasons. hey are more readily interpretable, can handle both continuous and cat-

egorical covariates, can accommodate data with missing values, provide an implicit

variableselection,andmodelinteractionswell.Mostfrequentlyusedtree-basedmod-

els are classification, regression, and survival trees.

Visualization isimportant inconjunction withtreemodelsbecauseintheirgraph-

icalformtheyareeasily interpretable evenwithout special knowledge.Interpretation

of decision trees displayed as a hierarchy of decision rules is highly intuitive.

Moreover tree models reflect properties of the underlying data and have other

supplemental information associated with them, such as quality of cut points, split

stability, and prediction trustworthiness. All this information, along with the com-

plex structure of the trees themselves, gives plenty of information that needs to be

explored and conveyed. Visualization provides a powerful tool for presenting differ-

ent key aspects of the models in a concise manner that allows quick comparisons.

In this chapter we will first quickly introduce tree models and present techniques

for visualizing individual trees. hose range from classical hierarchical views up to

less widely known methods such as treemaps and sectioned scatterplots.

In the next section we will use visualization tools to discuss the stability of splits

and entire tree models, motivating the use of tree ensembles and forests. Finally we

present methodsfordisplaying entire forestsat aglance andother waysforanalyzing

multiple tree models.

Individual Trees

10.2

hebasic principleof all tree-based methodsisarecursive partitioning of thecovari-

atesspacetoseparatesubgroupsthatconstituteabasisforprediction.hismeansthat

starting withthefulldataset ateachsteparuleisconsulted thatspecifieshowthedata

are split into disjoint partitions. his process is repeated recursively until there is no

rule defined for further partitioning.

Commonly usedclassification andregressiontreesuseunivariate decisionrulesin

eachpartitioning step,thatis,therulespecifyingwhichcasesfallintowhichpartition

evaluates only one data variable at a time. For continuous variables the rule usually

creates two partitions satisfying the equations x i

s, respectively, where s

is a constant. Partitions induced by rules using categorical variables are based on the

categories assigned to each partition. We refertoa partitioning step oten as split and

speak of the value s as the cut point.

herecursivepartitioning processcan bedescribedbyatree.herootnodecorre-

spondstothefirstsplitanditschildrentosubsequentsplitsintheresultingpartitions.

he tree is built recursively in the same way as the partitioning and terminal nodes

(also called leaves) represent final partitions. herefore each inner node corresponds

to a partitioning rule and each terminal node to a final partition.

<

s and x i

Data Visualization

Search WWH ::

Custom Search

Home