Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

The tests can be binary (yes/no) as in Test 2, or multi-variant (high, medium, low) as in Test 1. For

example, in operation, a decision tree can be used to categorize a protein based on a combination of

molecular weight, length, and configuration. As illustrated in the figure, the terminal or leaf nodes

needn't result in mutually exclusive categorization of the input data. Both Test 2 and Test 7 classify

the input into category (A), for example.

A potential limitation of using decision trees is related to their inability to represent relative

occurrence frequencies. For example, with a very small training set, it's likely that the terminal leaves

of a complex tree are defined by chance alone. Consider the typical evolutionary tree that represents

the speciation over the past several hundred-million years. A single fossil may be responsible for a

bifurcation in the tree, even though the fossil may represent a relatively small, insignificant mutation

in a much larger population. However, in the tree representation, the populations have equal

weights.

In some cases, this inability to represent the relative frequency of occurrence can be used to

advantage. For example, in classifying globins from a variety of species, multiple samples from the

same or closely related species may skew the relative abundance of some properties over others.

However, if these properties are represented as a decision tree, then the skew due to sample

anomalies can be avoided.

Hidden Markov Models

A powerful statistical approach to constructing classifiers that deserves a separate discussion is the

use of Hidden Markov Modeling. A Hidden Markov Model (HMM) is a statistical model for an ordered

Search WWH ::

Custom Search

Home