Biology Reference
In-Depth Information
into k equally sized bins, where
k is user-defined. Another simple method is quantile discretization which places
N
Interval discretizations divide the interval
[ v 1 ,v N ]
k (possibly duplicated) values in each bin [ 38 ]. Any method based on those two
approaches would suffer from problems that make it inappropriate for some data
sets. Interval discretizations are very sensitive to outliers and may produce a strongly
skewed range [ 39 ]. In addition, some discretization levels may not be represented at
all which may cause difficulties with their interpretation as part of the state space of
a discrete model.
On the other hand, quantile discretizations depend only on the ordering of the
observed values of v and not on the relative spacing values. Since distance between
the data points is often the only information that comes with short time series, losing
it is very undesirable. A shortcoming, common for both interval and quantile, as well
as for most other discretization methods, is that they require the number of discrete
states, k , to be user-provided.
A number of entropy-based discretization methods deserve attention. An example
of those is Hartemink's Information-Preserving Discretization (IPD) [ 35 ]. It relies
on minimizing the loss of pairwise mutual information between each two real-valued
vectors (variables). The mutual information between two random variables X and Y
with joint distribution p
/
(
,
)
(
)
(
)
X
Y
and marginal distributions p
x
and p
y
is defined as
p
(
x
,
y
)
I
(
X
;
Y
) =
p
(
x
,
y
)
log
) .
p
(
x
)
p
(
y
x
y
Note that if X and Y are independent, by definition of independence p
(
x
,
y
) =
0. Unfortunately, when modeling regulatory networks and
having as variables, for instance, mRNA, protein, and metabolite concentrations, the
joint distribution function is rarely known and it is often hard to determine whether
two variables are independent.
Another family of discretization techniques is based on clustering [ 40 ]. One of
the most often used clustering algorithms is the k -means [ 41 ]. It is a non-hierarchical
clustering procedure whose goal is to minimize dissimilarity in the elements within
each cluster while maximizing this value between elements in different clusters. Many
applications of the k-means clustering such as the MultiExperiment Viewer [ 42 ]start
by taking a random partition of the elements into k clusters and computing their
centroids. As a consequence, a different clustering may be obtained every time the
algorithm is run. Another inconvenience is that the number k of clusters to be formed
has to be specified in advance. Although there are methods for choosing “the best k
such as the one described in [ 43 ], they rely on some knowledge of the data properties
that may not be available.
Another method is single-link clustering (SLC) with the Euclidean distance func-
tion. SLC is a divisive (top-down) hierarchical clustering that defines the distance
between two clusters as the minimal distance of any two objects belonging to differ-
ent clusters [ 40 ]. In the context of discretization, these objects will be the real-valued
entries of the vector to be discretized, and the distance function that measures the
p
(
x
)
p
(
y
)
,so I
(
X
;
Y
) =
 
Search WWH ::




Custom Search