Biology Reference
In-Depth Information
Unlike most areas of microbiology research, data mining is not necessarily
hypothesis driven, although hypothesis-driven algorithms, notably statistical
approaches, are used. The aim of the data mining process is to find existing, but pre-
viously unidentified, patterns and interactions within large datasets, a process some-
times referred to as a “fishing expedition”, or, more formally, data-driven research.
Data-driven research must be undertaken with care. In large datasets apparently valid
correlations, which appear to make biological sense, can arise by chance, particularly
when large datasets are repeatedly analysed. A p -value of 0.001 is often regarded as
indicating statistical significance. However, all that this number means is that the
observed value is likely to arise by chance only once in 1000 trials. Data mining often
involves thousands of tests performed on large, noisy datasets. Careful human scru-
tiny and analysis of the results is essential.
There are literally hundreds of algorithms which can and have been used for data
mining in thousands of studies; a 2009 bibliometric study of the Web of Knowledge
database identified nearly 10,000 journal articles on data mining, published between
1962 and 2008 ( Shuang et al. , 2009 ). Amazon 3 currently lists over 20,000 topics
about data mining (not all of which will be relevant!). In the interest of not adding
to this number, this review is limited to describing the algorithms most commonly
used with microbiological data, with an indication of the ways in which the algo-
rithms have been applied. Where possible, pointers are provided to freely available
software implementing the various algorithms. Although there are also numerous
commercial data mining products available, it can be valuable for interested
researchers to have access to software that can be installed and investigated without
a significant impact upon the budget of the relevant grant.
Algorithms suitable for data mining can be organised into a conceptual hierarchy
( Figure 2.1 ).
Although, in the following sections, each algorithm is discussed individually, it is
worth noting that many projects involve the use of multiple data mining algorithms,
either sequentially or in parallel. The sequential application of algorithms results in
the generation of workflows in which the outputs of one analysis become the inputs
to the next. For example, data integration can be used to construct a network of inter-
actions between proteins; the network can then be subjected to a clustering algorithm
in order to identify functional modules; a Gene Ontology (GO; Ashburner et al. , 2000 )
over-representation analysis performed for each cluster; and then a protein function
inference algorithmapplied to the clustermembers to predict the potential functions of
un-annotated proteins. The application of different algorithms, in parallel to the same
data, and designed to perform the same task, can provide additional insights into data-
sets. AStepwise Linear Discriminant Analysis, for example, produces a classifier, and
a ranked list of the most informative features contributing to the classifier. ADecision
Tree does the same. Application of both algorithms to the same dataset, and compar-
ison of the results, can confirm (or not!) the importance of specific features.
3 http://www.amazon.co.uk/ .
Search WWH ::




Custom Search