Biology Reference
In-Depth Information
5.3 Decision trees
Decision trees are visually similar to the graphical representation of HMMs, but oper-
ate on very different principles. A decision tree is a type of classifier, which takes a set
of inputs describing individual data items, and classifies each item into one of a set of
categories. Decision tree algorithms are trained using a set of input examples, each
labelled with the category to which it belongs. The algorithms examine the input data
to determine which variable best distinguishes between the categories, and which
values of this variable are informative. This variable forms the root of the tree. The
remaining variables are scrutinised for the next-most-informative variable, which gen-
erates the second level of the tree. The process continues until maximum separation
between the output categories is achieved. Not all input variables will be included
in the tree, so the decision trees also provide a means of feature selection.
For example, ( Dieckmann and Malorny, 2011 ) used a decision tree to classify
serovars of Salmonella enterica subsp. enterica using data from MALDI-TOF
MS. Part of the resulting tree is redrawn below ( Figure 2.8 ).
There are many decision tree algorithms, each of which uses different ways of
identifying informative variables. One of the most widely used algorithms is C4.5
( Quinlan, 1993 ). C4.5 is freely available, although it is no longer supported. The suc-
cessor of C4.5, C5.0 is only available as source code, which requires a compiler for
the C language, and is therefore less accessible to the casual user.
The C4.5 algorithm uses a metric called entropy, which is derived from informa-
tion theory ( Shannon, 1948 ). Claude Shannon was an American mathematician and
MALDI Result
S.enterica
subsp .enterica
θ
18644
θ18655
18635
6036
θ6009
7097
θ7111
Enteritidis
Typhimurium
Virchow
Newport-II
Infantis
Austenberg
Colindale
8686;
10067;
6512;
6484;
6484;
8686;
9390
8699;
8699;
8699;
8686;
10048;
10067;
9404
9404
22979
22979
22979
Typhimurium
Virchow
Newport-II
Infantis
Austenberg
Colindale
FIGURE 2.8
Decision tree for the classification of Salmonella enterica subsp. enterica serovars. The input
data was generated using MALDI-TOF MS. Only part of the tree is shown.
Redrawn from Dieckmann and Malorny (2011) .
Search WWH ::




Custom Search