Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

producing a “protein of unknown function”; around 10% of the genome is a complete

mystery. There is therefore considerable interest in using computational methods to

address important problems such as the prediction of protein function and the iden-

tification of possible interactions between proteins.

The field of data mining was originally developed in the 1980s to make the most

of business and marketing data, of the type that is produced every time a credit card

transaction is made, or an electronic order is filled. Data mining is now widely

applied to the investigation of large datasets in many other fields, including molec-

ular biology. Indeed, it has been stated that “DNA sequencing and data mining have

become almost as central to biology as transcription and translation are to life”

( Yandell and Majoro, 2002 ).

This review aims to provide the working microbiologist with a guide to what

can be achieved using data mining. It covers the basic principles of data mining,

and describes the data mining life cycle. The algorithms most widely used for data

mining in microbiology are described, together with an indication of the types of

problems to which these algorithms have been applied. However, since data mining

is a very wide field of research, it is not possible in this review to cover all of the

relevant algorithms. Where appropriate, references are provided to more compre-

hensive textbooks, but for the most part we have limited the original literature

to recent publications that indicate how data mining is currently being applied

in microbiology.

2 WHAT IS DATA MINING?

A broad definition of data mining is “a set of mechanisms and techniques, realised in

software, to extract hidden information from data” ( Coenen, 2011 ). As Coenen

points out, “hidden” is the key word in this definition; simply retrieving from Gen-

Bank a list of genes meeting a particular criterion is not data mining: the result of

such a query is not information, but data. Data mining involves identifying patterns

within and between data sources, and deducing their meaning. Data mining was orig-

inally based on conventional statistics but, as the field of computational intelligence

(CI) developed over subsequent decades, many CI algorithms were eagerly seized

upon by data miners. Today both statistical and CI approaches are used in a comple-

mentary manner.

Data mining algorithms were developed in parallel in several distinct fields, and

consequently there is some confusion in terminology. Some forms of data mining are

also known as Knowledge Discovery in Databases, 2 or pattern recognition

( Kennedy, 1997 ), and there is considerable overlap between the fields of CI,

Machine Learning and Artificial Intelligence.

Search WWH ::

Custom Search

Home