Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

Unlike most areas of microbiology research, data mining is not necessarily

hypothesis driven, although hypothesis-driven algorithms, notably statistical

approaches, are used. The aim of the data mining process is to find existing, but pre-

viously unidentified, patterns and interactions within large datasets, a process some-

times referred to as a “fishing expedition”, or, more formally, data-driven research.

Data-driven research must be undertaken with care. In large datasets apparently valid

correlations, which appear to make biological sense, can arise by chance, particularly

when large datasets are repeatedly analysed. A p -value of 0.001 is often regarded as

indicating statistical significance. However, all that this number means is that the

observed value is likely to arise by chance only once in 1000 trials. Data mining often

involves thousands of tests performed on large, noisy datasets. Careful human scru-

tiny and analysis of the results is essential.

There are literally hundreds of algorithms which can and have been used for data

mining in thousands of studies; a 2009 bibliometric study of the Web of Knowledge

database identified nearly 10,000 journal articles on data mining, published between

1962 and 2008 ( Shuang et al. , 2009 ). Amazon 3 currently lists over 20,000 topics

about data mining (not all of which will be relevant!). In the interest of not adding

to this number, this review is limited to describing the algorithms most commonly

used with microbiological data, with an indication of the ways in which the algo-

rithms have been applied. Where possible, pointers are provided to freely available

software implementing the various algorithms. Although there are also numerous

commercial data mining products available, it can be valuable for interested

researchers to have access to software that can be installed and investigated without

a significant impact upon the budget of the relevant grant.

Algorithms suitable for data mining can be organised into a conceptual hierarchy

( Figure 2.1 ).

Although, in the following sections, each algorithm is discussed individually, it is

worth noting that many projects involve the use of multiple data mining algorithms,

either sequentially or in parallel. The sequential application of algorithms results in

the generation of workflows in which the outputs of one analysis become the inputs

to the next. For example, data integration can be used to construct a network of inter-

actions between proteins; the network can then be subjected to a clustering algorithm

in order to identify functional modules; a Gene Ontology (GO; Ashburner et al. , 2000 )

over-representation analysis performed for each cluster; and then a protein function

inference algorithmapplied to the clustermembers to predict the potential functions of

un-annotated proteins. The application of different algorithms, in parallel to the same

data, and designed to perform the same task, can provide additional insights into data-

sets. AStepwise Linear Discriminant Analysis, for example, produces a classifier, and

a ranked list of the most informative features contributing to the classifier. ADecision

Tree does the same. Application of both algorithms to the same dataset, and compar-

ison of the results, can confirm (or not!) the importance of specific features.

3 http://www.amazon.co.uk/ .

Search WWH ::

Custom Search

Home