Biology Reference
In-Depth Information
Once the time-consuming data pre-processing step is complete, the researcher can
turn to the more productive task of carrying out the analysis. The choice of algo-
rithms to be used depends upon both the nature of the data and the question to be
answered. It is often valuable to use several different algorithms, since individual
algorithms have different strengths and weaknesses, and together can provide com-
plementary information.
4 INFERENTIAL TECHNIQUES
4.1 Basic statistical analysis
Basic statistical approaches play two roles in the data mining process. As discussed
above, the calculation of basic statistics such as the mean, median and range for each
variable, provides a good initial understanding of the data, and helps to identify out-
liers and other unexpected values. In addition, where established statistical tech-
niques are applicable to the analysis of a particular dataset, they should be tried
before applying more heuristically based approaches. Statistical approaches are
based on long-established and well-accepted principles, and provide measures of
the probability that a particular observation would arise by chance. However,
standard statistical approaches are not always applicable to large, noisy biological
datasets. Many techniques assume that the data has a particular distribution—often
a normal, or Gaussian distribution—and may fail if this distribution is not present.
Some techniques have problems with missing or incomplete data. And sometimes it
is simply not feasible to perform a statistical test on a very large dataset; heuristic
approaches are sometimes necessary.
In this section we briefly discuss some of the most widely applicable basic sta-
tistical approaches to data mining, without going into technical details. There are
many excellent textbooks on statistics for the biological sciences ( Paulson, 2008 ).
In addition to assumptions about the distribution of the data, a major consider-
ation when using statistical analysis is the type of data input and output by different
techniques. Data may be:
￿ Continuous : real-valued numbers, such as the ratios produced by DNA
microarrays;
￿ Categorical : each number represents a different category. For example, a protein
may be coded as 0, nuclear; 1, cytoplasmic; 2, secreted. Categorical data may be
recoded as a set of dichotomous variables (variables that take on one of only two
possible values or “dummies”) for input to some algorithms. The protein
localisation data could be recoded as three variables, each of which takes the
value 0 (no) or 1 (yes): nuclear, cytoplasmic and secreted;
￿ Ordinal : each number represents a category, but the categories have an inherent
order. For example, data collection for a time-series analysis may be carried out
on 0, Monday; 1, Tuesday; 2, Wednesday.
Different statistical techniques consume and produce different types of data ( Table 2.1 ).
Search WWH ::




Custom Search