Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

Data Characterization

Consistency Analysis

Domain Analysis

Data Enrichment

Frequency and Distribution Analysis

Normalization

Missing Value Analysis

Data characterization involves creating a high-level description of the nature and the content of the

data to be mined. This stage in the knowledge-discovery process is primarily for the programmers

and other staff involved in a data-mining project. It provides a form of documentation that can be

referred to by those who may not be familiar with the underlying biology represented by the data.

Consistency analysis is the process of determining the variability in the data, independent of the

domain. Consistency analysis is primarily a statistical assessment of data, based solely on data

values. Outliers and values determined to be significantly different from other data may be

automatically excluded from the knowledge-discovery process, based on predefined statistical

constraints. For example, data associated with a given parameter that is more than three standard

deviations from the mean might be excluded from the mining operation.

Domain analysis involves validating the data values in the larger context of the biology. That is,

domain analysis goes beyond simply verifying that a data value is a text string or an integer, or that

it's statistically consistent with other data on the same parameter, to ensure that it makes sense in

the context of the biology. For example, values for physiological parameters can be validated to the

extent that they are within physiologically possible ranges consistent with life. A blood pH of 13, a

body temperature of 45 degrees Celsius, a protein with molecular weight of 20 milligrams, and a

patient age of 120 would be flagged as invalid values that should be excluded from the knowledge-

discovery process. Domain analysis requires that someone familiar with the biology create the

heuristics that can be applied to the data.

Data enrichment involves drawing from multiple data sources to minimize the limitations of a single

data source. For example, two databases on inherited diseases might each be sparsely populated in

terms of proteins that are associated with particular diseases. This deficit could be addressed by

incorporating data from both databases, assuming only a moderate degree of overlap in the content

of the two databases. Data enrichment may be tied to consistency analysis, so that outliers that

would skew knowledge-discovery results aren't included in the final analysis.

Frequency and distribution analysis places weights on values as a function of their frequency of

occurrence. The effect is to maximize the contribution of common findings while minimizing the effect

of rare occurrences on the conclusions made from the data-mining output. For example, a clinical

database of genetic diseases might contain 500 entries for one disease and only 1 entry for another,

based on the number of patients with each disease who were admitted to a given hospital or clinic.

Ignoring the relative frequency of each disease in the database could lead a researcher to conclude

that the odds of patients expressing either disease is the same.

The normalization process involves transforming data values from one representation to another,

using a predefined range of final values. For example, qualitative values, such as "high" and "low,"

and qualitative values from multiple sources regarding a particular parameter might be normalized to

a numerical score from 1 to 10.

The major issues in normalization are range, granularity, accuracy, precision, scale, and units. Range

is the difference between the highest and lowest values that are represented, whereas granularity is

Bioinformatics Computing

Search WWH ::

Custom Search

Home