Biology Reference
In-Depth Information
on different concepts is available. Similarity
measures are
algorithm (or K-medoids, depending on the
statistic applied) is an iterative method that starts
with k cluster centers randomly chosen. All
observations are then associated to the closest
cluster center and new centers are computed as
the mean of the observations of a given cluster.
The observations are grouped with respect to
the new centers iteratively until convergence;
that is, no difference occurs in the next iteration. 78
The fuzzy c-means algorithm was introduced to
allow the association of an observation to more
than one cluster with a probability of belonging
to each cluster. 79
rst computed between observa-
tions, and between clusters once observations
begin to be grouped into clusters. Several
metrics, such as Euclidean and Manhattan
distance, correlation, or mutual information,
can be used to compute similarity. Additionally,
several merging strategies that lead to different
clustering patterns are possible. Clustering
results are therefore somewhat subjective, as
they greatly depend on the users
choices.
Traditional cluster analysis is usually performed
to group either observations or variables sepa-
rately but simultaneous co-clustering (or biclus-
tering) of the rows and the columns of the data
matrix constitutes also a suitable alternative to
search for biomarkers. 77
As it uses a hierarchical con
'
Regression and Classi
cation with
Supervised Methods
Unlike the aforementioned approaches,
supervised learning takes advantage of prior
information for the analysis of a set of observa-
tions. An outcome d the response d can be
observed or measured and the modeling process
aims at its prediction. This response can be quan-
titative in the case of regression or qualitative in
the context of classi
guration d a tree
called a dendrogramd to structure the data, hierar-
chical cluster analysis (HCA) is an intuitive way
to perform data clustering when the number of
clusters is unknown a priori . Each leaf corre-
sponds to an observation and the branching
re
ects the relation between clusters. Two dis-
tinct algorithms can be applied d agglomerative
(grouping observations) or divisive (dividing
the data set) d but in practice the agglomerative
approach is of wider use.
In this case, a linkage function de
cation. A training set is used
to build a model encapsulating general hypoth-
eses that depicts the relations between a set of
measured independent variables X and one or
more dependent responses Y. Several techniques
have been developed for that purpose, origi-
nating from the statistical, chemometric, or
machine learning background. Outputs of some
classical unsupervised and supervised modeling
methods are shown in Figure 5 .
nes the
criteria for evaluating distances between obser-
vations and clusters. At each iteration, the closest
objects are grouped to form a new cluster.
Alternatives to HCA often necessitate de
ning
the number of clusters a priori . The K-means
FIGURE 5 Typical data modeling outputs.
Search WWH ::




Custom Search