The Structural Network Properties of Biological Systems - Biological Networks

Information Technology Reference

In-Depth Information

throughput methodologies followed by a detailed computational analysis to (i) re-

duce the noise, (ii) to normalize and nally (iii) to extract statistically meaningful

biological information from such data. One such type of information is the TRN.

The DNA sequence elements representing binding sites for TFs are responsible for

the coordinated expression of genes possessing a given TF binding site i.e. TRNs

can be deciphered by nding all true TFs binding sites in a given genome. These

regulatory sites are however short, degenerated and embedded into large regions

of non{coding DNA. The shortness of TF binding sites translates in a high num-

ber of false-positives which is moreover increased by the well known degeneracy

of regulatory motifs. A relatively novel approach to improve the computational

identication of TF binding sites takes advantage of the identication of clusters of

co-expressed genes over a large number of dierent conditions. The upstream re-

gions of co-expressed genes can then be analyzed for the presence of shared sequence

motifs which might explain the observed co-regulation.

Identifying Candidate Motifs: Several computational methods for the discov-

ery of transcription binding sites (TFBSs) have been described [14]. They can

be schematically divided into two main approaches. The rst one evaluates the

frequencies of all possible sequences of length n (n-mers) iteratively and at each

step it updates a position weighted probability matrix (PWM) corresponding to

the candidate motif. A background distribution of n-mers is calculated for a set of

non-coregulated genes so that n-mers that are more abundant than expected can be

identied [15{18]. The second approach species the n-mer as a PWM and utilizes

an iterative algorithm, typically represented as an expectation maximization [19] or

Gibbs sampling procedure [20{22]. These iterative methods can extend the window

size to lengths n > 8 nucleotides which is an upper limit for enumerative techniques;

on the other hand, these methods might be trapped at local optima. Improved mod-

els of the background sequence distribution using high order Markov models [23]

to represent the null distribution increase the accuracy. A score reecting how well

a DNA string matches each candidate motif can be calculated by taking into ac-

count its PWM, the background model and the number of motif occurrences for

each promoter sequence. One of the most commonly used scoring functions is the

following: let be a motif of length w, and occurrences i in upstream sequence g;

the corresponding PWM is M , of size wn, where n = 4, one column for each

possible nucleotide. We calculate the motif probability on M , P ( i jM ) and on

the background model, P ( i jM BKG ). Then, the scoring function is:

"

#

X

i

P ( i jM )

P ( i jM BKG )

S ;g = log 2

:

i=1

This formula is widely used e.g. MotifRegressor [24] and MotifScorer [25]. An-

other possible scoring function is the one proposed by [26], which does not require

Biological Networks

Search WWH ::

Custom Search

Home