Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

Fig. 5.7. Output of SOM where distances between neurons are mapped to (a) gray scale; (b) color

map (Kohonen [19]).

For clustering analysis, what is left is to assign a cluster label to each obser-

vation in matrix X . After visualizing the output of SOM, on the map, one can

label the regions considered to be clusters with different numbers. For instance,

in Fig. 5.7(b), we can label the top region, the left-bottom region, and the right-

bottom region as 1, 2, and 3, respectively. Then for each observation, the neuron

on the map with the highest similarity (or lowest dissimilarity) with the observa-

tion is identified. The observation is assigned a cluster label according to the label

of the region where the identified neuron falls in. If the identified neuron falls

in the areas separating the regions of clusters, the observation is identified as an

outlier.

The advantages of SOM exist in the following folds. First, it does not require

the number of cluster as the input. It completes the clustering and identifying the

number of clusters at the same time. Second, the observations enter the algorithm

sequentially, which means we do not have to load the whole dataset into the mem-

ory for clustering analysis. It is very helpful when the dataset is too large for the

computer to load all into the memory at the same time. It is also very helpful in the

case that the whole dataset is not available but that the observations come sequen-

tially. Third, SOM is a distance preserving data visualization method. It maps the

high dimensional dataset into a 2-D map, where the distance between observations

is preserved in the distance between the weight vectors associated with neurons,

and the distance is visualized by gray scale or color maps, as in Fig. 5.7. Fourth,

statistically, SOM simulates the density distribution of the dataset. For example,

in Fig. 5.7(b), we can see that the whole dataset has three areas with high density,

and different dense regions have different density distributions.

The biggest problem of SOM is that it is subjective. Although SOM identifies

the number of clusters and cluster the objects at the same time, the number of

clusters is still based on subjective judgment of human beings. For instance, in

Search WWH ::

Custom Search

Home