Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

statistical perspective, is near sequences that tend to be found between introns and exons. However,

even with heuristics, user-directed discovery is inherently limited by the time required to manually

search for new data.

An alternative to manual searching—and one that has had considerable success in the travel,

banking, and telecommunications industries—is to use computer-mediated data mining, the process

of automatically extracting meaningful patterns from usually very large quantities of seemingly

unrelated data. Unlike human-directed exploration of databases, data mining can initiate queries that

aren't limited to the user's fluency in authoring effective database queries. This isn't to say that data

mining reduces the need for the researcher to establish a strategy or to evaluate the results of a data-

mining session. When used in conjunction with the appropriate visualization tools, data mining allows

the researcher to use her highly advanced pattern-recognition skills and knowledge of molecular

biology to determine which results warrant further study. For example, mining the millions of data

points from a series of microarray experiments might reveal several clusters of data, as visualized in

a 3D cluster display. The researcher could then select data belonging to one or more of the clusters

and use a variety of tools to determine the parameters that distinguish it from the other data.

Given the ever-increasing store of sequence and protein data from several worldwide genome

projects, data mining the sequences has become a major research focus in bioinformatics. This is in

part because molecular biologists can now conduct basic bioinformatics research from their desktop

workstation, without the overhead of establishing a wet lab. The aim of this chapter is to explore data-

mining techniques as an automated means of reducing the complexity of data in large bioinformatics

databases and of discovering meaningful, useful patterns and relationships in data. The " Methods "

section explores data mining from the perspective of the process of knowledge discovery.

"Technology Overview" reviews the underlying computer infrastructure and algorithms that make

data mining a practical endeavor. "Infrastructure" reviews the hardware and software requirements

of an efficient data-mining operation. "Pattern Recognition and Discovery" explores the basic

patternrecognition process and how it can be extended to pattern discovery.

The " Machine Learning " section reviews the numerous technologies that can be applied to support

data mining, from neural networks to Hidden Markov Models. "Text Mining" focuses on the

importance of mining the biomedical literature for data on functions to complement the sequence and

structure data mined from nucleotide and protein databases. The " Tools " section introduces some of

the practical general-purpose and bioinformaticsspecific tools available for data mining. The " On the

Horizon " section looks at the leading-edge data-mining technologies, especially real-time transaction

monitoring that promises to decrease the infrastructure requirements. The " Endnote " section

explores the long-term role of machine learning versus human-directed data-mining efforts.

Search WWH ::

Custom Search

Home