Biology Reference
In-Depth Information
producing a “protein of unknown function”; around 10% of the genome is a complete
mystery. There is therefore considerable interest in using computational methods to
address important problems such as the prediction of protein function and the iden-
tification of possible interactions between proteins.
The field of data mining was originally developed in the 1980s to make the most
of business and marketing data, of the type that is produced every time a credit card
transaction is made, or an electronic order is filled. Data mining is now widely
applied to the investigation of large datasets in many other fields, including molec-
ular biology. Indeed, it has been stated that “DNA sequencing and data mining have
become almost as central to biology as transcription and translation are to life”
( Yandell and Majoro, 2002 ).
This review aims to provide the working microbiologist with a guide to what
can be achieved using data mining. It covers the basic principles of data mining,
and describes the data mining life cycle. The algorithms most widely used for data
mining in microbiology are described, together with an indication of the types of
problems to which these algorithms have been applied. However, since data mining
is a very wide field of research, it is not possible in this review to cover all of the
relevant algorithms. Where appropriate, references are provided to more compre-
hensive textbooks, but for the most part we have limited the original literature
to recent publications that indicate how data mining is currently being applied
in microbiology.
2 WHAT IS DATA MINING?
A broad definition of data mining is “a set of mechanisms and techniques, realised in
software, to extract hidden information from data” ( Coenen, 2011 ). As Coenen
points out, “hidden” is the key word in this definition; simply retrieving from Gen-
Bank a list of genes meeting a particular criterion is not data mining: the result of
such a query is not information, but data. Data mining involves identifying patterns
within and between data sources, and deducing their meaning. Data mining was orig-
inally based on conventional statistics but, as the field of computational intelligence
(CI) developed over subsequent decades, many CI algorithms were eagerly seized
upon by data miners. Today both statistical and CI approaches are used in a comple-
mentary manner.
Data mining algorithms were developed in parallel in several distinct fields, and
consequently there is some confusion in terminology. Some forms of data mining are
also known as Knowledge Discovery in Databases, 2 or pattern recognition
( Kennedy, 1997 ), and there is considerable overlap between the fields of CI,
Machine Learning and Artificial Intelligence.
2 http://www.igi-global.com/journal/international-journal-knowledge-discovery-bioinformatics/1143 .
Search WWH ::




Custom Search