Biology Reference
In-Depth Information
distance between two vector entries v and w will be the one-dimensional Euclidean
distance
. Top-down clustering algorithms start from the entire data set and
iteratively split it until either the degree of similarity reaches a certain threshold or
every group consists of one object only. For the purpose of data analysis, it is imprac-
tical to let the clustering algorithm produce clusters containing only one real value.
The iteration at which the algorithm is terminated is crucial since it determines the
degree of the discretization. SLC with the Euclidean distance function has one major
advantage: very little starting information is needed (only distances between points).
In addition, being a hierarchical clustering procedure it lends itself to adjustment in
case that clusters need to be split or merged. It may result, however, in a discretiza-
tion where most of the points are clustered into a single partition if they happen to be
relatively close to one another. Another problemwith SLC is that its direct implemen-
tation takes D , the desired number of discrete states, as an input. However, we would
like to choose D as small as possible, without losing information about the system
dynamics and the correlation between the variables, so that an essentially arbitrary
choice is unsatisfactory.
In [ 44 ], a hybrid method for discretization of short time series of experimental data
into a finite number of states was introduced. It is a modification of the SLC algorithm:
it begins by discretizing a vector in the same way as SLC does but instead of providing
D as part of the input, the algorithm contains termination criteria which determine
the appropriate value of D . After that each discrete state is checked for information
content and if it is determined that this content can be considerably increased by
further discretization, then the state is separated into two states in a way that may not
be consistent with SLC.
If more than one vector is to be discretized, the algorithm discretizes each vector
independently. (For details on multiple vector discretization, see [ 44 ]). If the vector
contains m distinct entries, a complete weighted graph on m vertices is constructed,
where a vertex represents an entry and an edge weight is the Euclidean distance
between its endpoints. The discretization process starts by deleting the edge(s) of
highest weight until the graph gets disconnected. If there is more than one edge
labeled with the current highest weight, then all of the edges with this weight are
deleted. The order in which the edges are removed leads to components, in which
the distance between any two vertices is smaller than the distance between any two
components, a requirement of SLC. We define the distance between two components
G and H to be dist
| v w |
.The output of the algorithm
is a discretization of the vector, in which each cluster corresponds to a discrete state
and the vector entries that belong to one component are discretized into the same
state.
Example 3.12.
(
G
,
H
) =
min
{|
g
h
||
g
G
,
h
H
}
is to be discretized.
The diagram obtained by the SLC algorithm is given in Figure 3.3 . The complete
weighted graph in Figure 3.4 corresponds to iteration 0 of the diagram.
Having disconnected the graph, the next task is to determine if the obtained degree
of discretization is sufficient; if not, the components need to be further disconnected in
Suppose that vector v
= (
1
,
2
,
7
,
9
,
10
,
11
)
 
Search WWH ::




Custom Search