Data Analysis - Biomedical Informatics in Translational Research

Biomedical Engineering Reference

In-Depth Information

9.2.3 Predictive Modeling

Predictive modeling is the practice of developing a mathematical model that can be

used to predict an outcome or a value based on a set of input parameters. Modeling

methodologies extend from very simple linear techniques up through highly sophis-

ticated machine learning algorithms. Regardless of the specific form, all models have

a few basic properties in common; they all take a series of inputs, termed attributes

in this context, and they all produce some form of a result, termed the target attrib-

ute. Models come in two distinct varieties, regression and classification. Regression

models are those that have a continuous numeric target attribute. Categorical mod-

els are those that have a discreet categorical target attribute.

9.2.3.1 Feature Selection

Before moving on to the specifics of different modeling techniques, it is first neces-

sary to discuss the problem of feature selection, that is, determining the set of data to

use as input for our model. Translational research presents a number of challenges

in predictive modeling, one of which is the fact that the number of attributes is often

far greater than the number of samples. As an example, let's look at the problem of

classifying tumor types based on gene expression profiles. Using modern DNA

microarray techniques, our attributes (genes) will be vastly greater (1,000-fold or

more) than our number of samples (tumors). Research has shown that reducing the

number of attributes to a small set of informative genes can greatly enhance the

accuracy of our models [12]. Feature selection algorithms help us to identify the

most informative features in a data set. Here we look at two particular feature selec-

tion algorithms that have been shown to be useful in many different domains,

information gain and Relief-F.

Information Gain

The information gain method examines each feature and measures the entropy

reduction in the target class distribution if that feature is used to partitions the data

set [11].

Relief-F

The Relief-F method draws instances at random, computes their nearest neighbors,

and adjusts a feature weighting vector to give more weight to features that discrimi-

nate the instance from neighbors of different classes [13].

9.2.3.2 Regression

The problem of regression is perhaps best illustrated as an example of curve fitting.

From an initial set of data points, we create a function that draws a line that comes

as close as possible to all of the points. Now, using our function we can estimate

where on the line any new (unknown) data points might fall. In linear regression, we

are limited to only being able to use functions that produce straight lines. Linear

regression models are simple to calculate and easy to evaluate, their utility is limited

however. Figure 9.1 below is an example of linear regression.

Search WWH ::

Custom Search

Home