Visualizing Trees and Forests - Data Visualization

Graphics Reference

In-Depth Information

Each final partition has been assigned a prediction value or model. For classifi-

cation trees the value is the predicted class, for regression trees it is the predicted

constant, but more complex tree models exist such as those featuring linear models

in terminal nodes. In what follows we will mostly use classification trees with binary

splits for illustration purposes, but all methods can be generalized for more complex

tree models unless specifically stated otherwise. We call a tree consisting of rules in

inner nodes regardless of the type of prediction in the leaves a decision tree.

Hierarchical Views

10.2.1

Probably the most natural way to visualize a tree model is to display its hierarchical

structure. Let us describe more precisely what it is we want to visualize. To describe

the topology of a tree, we want to borrow some terminology for the graph theory.

A graph is a set of nodes (sometimes called vertices)andedges.hereatree is defined

as a connected, acyclic graph. Topologically, decision trees are a special subset of

those, namely, connected directed acyclic graphs (DAGs) with exactly one node of

indegree (the root - it has no parent) and outdegrees other than (i.e., at least two

children or none at all).

To fully describe a decision tree, additional information is associated with each

node. For inner nodes this information represents the splitting rule; for terminal

nodes it consists of the prediction. Plots of tree models attempt to make such infor-

mationvisibleinaddition todisplayingthegraphaspectofthemodel.hreedifferent

ways to visualize the same classification tree model are shown in Fig. . .

he tree model is based on the Italian olive oil dataset (Forina et al. ), which

records the composition of Italian olive oils from different regions of Italy. Each co-

variate corresponds to the proportion (in / th) of a fatty acid (in the order of

concentration): oleic, palmitic, linoleic, stearic, palmitoleic, arachidic, linolenic,and

eicosenoic acid. he response variable is categorical and specifies the region of ori-

gin. he goal is to determine how the composition of olive oils varies across re-

gions of Italy. For illustration purposes we perform a classification using five regions:

Sicily, Calabria, Sardinia, Apulia,andNorth (the latter consolidating regions north

of Apulia).

Although the underlying model is the same for all plots in Fig. . , the visual

representation is different in each plot. Visualization of a tree model based on its

hierarchical structure has to contemplate the following tasks:

Placement of nodes

Visual representation of nodes

Visual representation of edges

Annotation

Each task can be used to represent additional information associated with the model

or data. Visual representation of a node is probably the most obvious way to add such

information. In the first (top let) plot, a node consists solely of a tick mark with an

annotation describing the split rule for the let child. In the second (top right) plot,

a node is represented by a rectangle whose size corresponds to the number of cases

Data Visualization

Search WWH ::

Custom Search

Home