Biomedical Engineering Reference
In-Depth Information
Data Characterization
Consistency Analysis
Domain Analysis
Data Enrichment
Frequency and Distribution Analysis
Normalization
Missing Value Analysis
Data characterization involves creating a high-level description of the nature and the content of the
data to be mined. This stage in the knowledge-discovery process is primarily for the programmers
and other staff involved in a data-mining project. It provides a form of documentation that can be
referred to by those who may not be familiar with the underlying biology represented by the data.
Consistency analysis is the process of determining the variability in the data, independent of the
domain. Consistency analysis is primarily a statistical assessment of data, based solely on data
values. Outliers and values determined to be significantly different from other data may be
automatically excluded from the knowledge-discovery process, based on predefined statistical
constraints. For example, data associated with a given parameter that is more than three standard
deviations from the mean might be excluded from the mining operation.
Domain analysis involves validating the data values in the larger context of the biology. That is,
domain analysis goes beyond simply verifying that a data value is a text string or an integer, or that
it's statistically consistent with other data on the same parameter, to ensure that it makes sense in
the context of the biology. For example, values for physiological parameters can be validated to the
extent that they are within physiologically possible ranges consistent with life. A blood pH of 13, a
body temperature of 45 degrees Celsius, a protein with molecular weight of 20 milligrams, and a
patient age of 120 would be flagged as invalid values that should be excluded from the knowledge-
discovery process. Domain analysis requires that someone familiar with the biology create the
heuristics that can be applied to the data.
Data enrichment involves drawing from multiple data sources to minimize the limitations of a single
data source. For example, two databases on inherited diseases might each be sparsely populated in
terms of proteins that are associated with particular diseases. This deficit could be addressed by
incorporating data from both databases, assuming only a moderate degree of overlap in the content
of the two databases. Data enrichment may be tied to consistency analysis, so that outliers that
would skew knowledge-discovery results aren't included in the final analysis.
Frequency and distribution analysis places weights on values as a function of their frequency of
occurrence. The effect is to maximize the contribution of common findings while minimizing the effect
of rare occurrences on the conclusions made from the data-mining output. For example, a clinical
database of genetic diseases might contain 500 entries for one disease and only 1 entry for another,
based on the number of patients with each disease who were admitted to a given hospital or clinic.
Ignoring the relative frequency of each disease in the database could lead a researcher to conclude
that the odds of patients expressing either disease is the same.
The normalization process involves transforming data values from one representation to another,
using a predefined range of final values. For example, qualitative values, such as "high" and "low,"
and qualitative values from multiple sources regarding a particular parameter might be normalized to
a numerical score from 1 to 10.
The major issues in normalization are range, granularity, accuracy, precision, scale, and units. Range
is the difference between the highest and lowest values that are represented, whereas granularity is
Search WWH ::




Custom Search