Biomedical Engineering Reference
In-Depth Information
Endnote
The technologies and methodologies associated with data mining and knowledge discovery, while
mature in areas such as fraud detection in credit card use, are not yet fully developed for
bioinformatics applications. One issue is that, while fraud can be defined on an intuitive
basis—sudden expenditure for luxury goods, transactions through vendors not frequented in the past,
out-of-state transactions, and the like—much of the nature of genetic material under scrutiny is
unknown.
Because researchers provide the final filtering in the knowledge-discovery process, it's likely that
unfamiliar concepts—truly new discoveries—will more likely be attributed to chance clustering than to
some underlying process. What's more, labels such as "junk DNA," for example, influences the
amount of time and energy that a researcher will invest in applying data-mining tools to the non-
coding regions of a genome, in favor of areas more likely to provide meaningful results. Similarly, for
years scientists took for granted that there were only 20 genetically coded amino acids. When
additional amino acids were discovered, they were first verified by arduous wet-lab work that
required several years of work. For example, it took scientists over two years to crystallize and
determine the structure of pyrrolysine, the 22nd amino acid. Given the existence of an additional
amino acid, however, searching through a database for occurrences ignored in the past is
comparatively trivial.
Despite the effects of bias, humans are an indispensable part of the data-mining process. One reason
for their continued inclusion in what would otherwise be an automated process is that current
technologies assume uniform and relatively simple data structures. Very large, complex databases,
replete with multiple potential relationships present scalability issues that may require significant
computational time on powerful computer systems. In addition, many of the traditional data-mining
methods were developed for homogenous numerical data. However, bioinformatics databases
increasingly hold text sequences, protein structure, and other data sets that are anything but
homogeneous.
The technical challenges associated with data mining are compounded by the lack of statistical
methods that can adequately assess the significance of figures calculated from very large database
sets. Similarly, because few bioinformatics databases are static, but are growing exponentially with
time, the statistical concept of a fixed population from which samples are drawn is violated. As a
result, a statistical analysis of a particular relationship at one point in time may provide a different
result a month or two later. These and similar challenges remain for those in the bioinformatics arena
to solve.
 
 
Search WWH ::




Custom Search