Biology Reference
In-Depth Information
in multivariate regression analysis. To determine the best model, models are fitted
with all possible combinations of the explanatory variables. For each fitted model,
the BIC is calculated as:
BIC = n ln ( RSS
n
)+ k ln ( n )
(5.27)
where RSS is the residual sum of squares of the fitted model, n is the number of
observations in the training dataset, and k is the number of explanatory variables
in the model. The model with the smallest BIC is selected as the best model.
The similarity between the model-based method to determine the number of
clusters and the BIC method to choose the best model exists in three sides. First,
Eq. 5.25 is very similar with Eq. 5.27; Second, the second component of the right
hand side of Eq. 5.25 is the penalty to prevent too many clusters. In Eq. 5.27,
k ln ( n ) is also a penalty to prevent too many explanatory variables, which leads to
overfitting the data. Third, in BIC, we choose the model with the smallest BIC
as the best model. Contrarily in model-based method to determine the number of
clusters, we choose the number of clusters as the one corresponding to the largest
adjusted loglikelihood value.
The disadvantage of the model-based method is that one has to assume the
form of the underlying probability density function before we apply this method
to determine the number of clusters. Usually, users assume multivariate normal
distributions. This implies that all clusters are convex. The convexity of clusters is
difficult to justify, especially when the dimension of the dataset is high. Figure 5.9
illustrates this disadvantage with a two-dimension dataset which obviously has
three clusters, one of which is non-convex. The model-based method with the
assumption of bi-variate normal distribution detects 6 clusters, instead of the true
value 3.
5.4.2. Scale-based Method
The determination of the number of clusters is not only subject to the definition of
clusters, as stated in the introduction of this section, but also subject to the resolu-
tion level we choose when we view the clusters. Figure 5.10 shows a case where
different resolution may give different determination of the number of clusters. In
Fig. 5.10, if we use a high resolution, we can conclude that there are 3 clusters,
as in Fig. 5.10(a). Contrarily, if we use a low resolution, we can conclude with 2
clusters; see Fig. 5.10(b).
The readers should be noted that different number of clusters caused by us-
ing different resolution is also subject to the definition of clusters. As shown in
Fig. 5.10(b), if we use a low resolution, we get two clusters. Points in cluster 1
Search WWH ::




Custom Search