Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

based on statistical methods. One goal in using regression methods is to extrapolate trends from a

few samples of the data. In the example in Figure 7-2 , the extrapolation formula is a simple linear

function of the form:

y = mx + b

where x and y are coordinates on the plot, m is the slope of the line, and b is a constant. In practice,

more complex extrapolation formulas are used to describe data trends.

Link analysis evaluates apparent connections or links between data in the database or data

warehouse. Link analysis highlights correlations in data that can suggest linkage, but not causality. In

the illustration, the two pairs of data points are apparently linked, in that the value of one data

element in the pair can be predicted by the value of the other data point in the pair.

Deviation detection identifies data values that are outside of the norm, as defined by existing models

or by evaluating the ordering of observations. The outlier in the illustration is an example of a data

value outside of the expected spread of data in a sample. The data may represent a particular

sequence of amino acids or the molecular weight of a protein, or a vital sign, for example.

Segmentation-based data mining identifies classes or groups of data that behave similarly, according

to some metric. Segmentation is akin to link analysis applied to groups of data instead of individual

data points. In the figure, groups (A) and (C) behave similarly.

These methods of data mining are typically used in combination with each other, either in parallel or

as part of a sequential operation. For example, segmentation requires classes to be defined through a

classification process. Similarly, link analysis assumes that statistical analysis, including correlation

coefficients, are available. Likewise, deviation detection assumes that the data have been properly

classified and evaluated statistically to define the "normal" model. As described later in this chapter,

there are a variety of technologies available to support these methods.

Evaluation

In the evaluation phase of knowledge discovery, the patterns identified by the data-mining analysis

are interpreted. Typical evaluation ranges from simple statistical analysis and complex numerical

analysis of sequences and structures to determining the clinical relevance of the findings.

Visualization

Visualization of evaluation results is an optional stage in the knowledge-discovery process, but one

that typically adds considerable value to the overall system. Visualization can range from converting

tabular listings of data summaries to pie charts and similar business graphics, to using real-time data

to create 3D virtual reality displays that can be manipulated by haptic controllers.

Designing New Queries

Data mining is an iterative continual activity, in that there are always new hypotheses to test.

Sometimes the new hypotheses are suggested by the data returned by the mining process, and other

times the hypotheses originate from other research. In either case, testing the new hypotheses

requires formulating new queries and revisiting the selection and sampling stage of the data-mining

process.

Search WWH ::

Custom Search

Home