Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

Once the time-consuming data pre-processing step is complete, the researcher can

turn to the more productive task of carrying out the analysis. The choice of algo-

rithms to be used depends upon both the nature of the data and the question to be

answered. It is often valuable to use several different algorithms, since individual

algorithms have different strengths and weaknesses, and together can provide com-

plementary information.

4 INFERENTIAL TECHNIQUES

4.1 Basic statistical analysis

Basic statistical approaches play two roles in the data mining process. As discussed

above, the calculation of basic statistics such as the mean, median and range for each

variable, provides a good initial understanding of the data, and helps to identify out-

liers and other unexpected values. In addition, where established statistical tech-

niques are applicable to the analysis of a particular dataset, they should be tried

before applying more heuristically based approaches. Statistical approaches are

based on long-established and well-accepted principles, and provide measures of

the probability that a particular observation would arise by chance. However,

standard statistical approaches are not always applicable to large, noisy biological

datasets. Many techniques assume that the data has a particular distribution—often

a normal, or Gaussian distribution—and may fail if this distribution is not present.

Some techniques have problems with missing or incomplete data. And sometimes it

is simply not feasible to perform a statistical test on a very large dataset; heuristic

approaches are sometimes necessary.

In this section we briefly discuss some of the most widely applicable basic sta-

tistical approaches to data mining, without going into technical details. There are

many excellent textbooks on statistics for the biological sciences ( Paulson, 2008 ).

In addition to assumptions about the distribution of the data, a major consider-

ation when using statistical analysis is the type of data input and output by different

techniques. Data may be:

Continuous : real-valued numbers, such as the ratios produced by DNA

microarrays;

Categorical : each number represents a different category. For example, a protein

may be coded as 0, nuclear; 1, cytoplasmic; 2, secreted. Categorical data may be

recoded as a set of dichotomous variables (variables that take on one of only two

possible values or “dummies”) for input to some algorithms. The protein

localisation data could be recoded as three variables, each of which takes the

value 0 (no) or 1 (yes): nuclear, cytoplasmic and secreted;

Ordinal : each number represents a category, but the categories have an inherent

order. For example, data collection for a time-series analysis may be carried out

on 0, Monday; 1, Tuesday; 2, Wednesday.

Different statistical techniques consume and produce different types of data ( Table 2.1 ).

Search WWH ::

Custom Search

Home