Inferring the Topology of Gene Regulatory Networks: An Algebraic Approach to Reverse Engineering - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

into k equally sized bins, where

k is user-defined. Another simple method is quantile discretization which places

Interval discretizations divide the interval

[ v 1 ,v N ]

k (possibly duplicated) values in each bin [ 38 ]. Any method based on those two

approaches would suffer from problems that make it inappropriate for some data

sets. Interval discretizations are very sensitive to outliers and may produce a strongly

skewed range [ 39 ]. In addition, some discretization levels may not be represented at

all which may cause difficulties with their interpretation as part of the state space of

a discrete model.

On the other hand, quantile discretizations depend only on the ordering of the

observed values of v and not on the relative spacing values. Since distance between

the data points is often the only information that comes with short time series, losing

it is very undesirable. A shortcoming, common for both interval and quantile, as well

as for most other discretization methods, is that they require the number of discrete

states, k , to be user-provided.

A number of entropy-based discretization methods deserve attention. An example

of those is Hartemink's Information-Preserving Discretization (IPD) [ 35 ]. It relies

on minimizing the loss of pairwise mutual information between each two real-valued

vectors (variables). The mutual information between two random variables X and Y

with joint distribution p

(

)

(

)

(

)

and marginal distributions p

and p

is defined as

(

)

(

;

) =

(

)

log

) .

(

)

(

Note that if X and Y are independent, by definition of independence p

(

) =

0. Unfortunately, when modeling regulatory networks and

having as variables, for instance, mRNA, protein, and metabolite concentrations, the

joint distribution function is rarely known and it is often hard to determine whether

two variables are independent.

Another family of discretization techniques is based on clustering [ 40 ]. One of

the most often used clustering algorithms is the k -means [ 41 ]. It is a non-hierarchical

clustering procedure whose goal is to minimize dissimilarity in the elements within

each cluster while maximizing this value between elements in different clusters. Many

applications of the k-means clustering such as the MultiExperiment Viewer [ 42 ]start

by taking a random partition of the elements into k clusters and computing their

centroids. As a consequence, a different clustering may be obtained every time the

algorithm is run. Another inconvenience is that the number k of clusters to be formed

has to be specified in advance. Although there are methods for choosing “the best k ”

such as the one described in [ 43 ], they rely on some knowledge of the data properties

that may not be available.

Another method is single-link clustering (SLC) with the Euclidean distance func-

tion. SLC is a divisive (top-down) hierarchical clustering that defines the distance

between two clusters as the minimal distance of any two objects belonging to differ-

ent clusters [ 40 ]. In the context of discretization, these objects will be the real-valued

entries of the vector to be discretized, and the distance function that measures the

(

)

(

)

,so I

(

;

) =

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home