Information Technology Reference
In-Depth Information
throughput methodologies followed by a detailed computational analysis to (i) re-
duce the noise, (ii) to normalize and nally (iii) to extract statistically meaningful
biological information from such data. One such type of information is the TRN.
The DNA sequence elements representing binding sites for TFs are responsible for
the coordinated expression of genes possessing a given TF binding site i.e. TRNs
can be deciphered by nding all true TFs binding sites in a given genome. These
regulatory sites are however short, degenerated and embedded into large regions
of non{coding DNA. The shortness of TF binding sites translates in a high num-
ber of false-positives which is moreover increased by the well known degeneracy
of regulatory motifs. A relatively novel approach to improve the computational
identication of TF binding sites takes advantage of the identication of clusters of
co-expressed genes over a large number of dierent conditions. The upstream re-
gions of co-expressed genes can then be analyzed for the presence of shared sequence
motifs which might explain the observed co-regulation.
Identifying Candidate Motifs: Several computational methods for the discov-
ery of transcription binding sites (TFBSs) have been described [14]. They can
be schematically divided into two main approaches. The rst one evaluates the
frequencies of all possible sequences of length n (n-mers) iteratively and at each
step it updates a position weighted probability matrix (PWM) corresponding to
the candidate motif. A background distribution of n-mers is calculated for a set of
non-coregulated genes so that n-mers that are more abundant than expected can be
identied [15{18]. The second approach species the n-mer as a PWM and utilizes
an iterative algorithm, typically represented as an expectation maximization [19] or
Gibbs sampling procedure [20{22]. These iterative methods can extend the window
size to lengths n > 8 nucleotides which is an upper limit for enumerative techniques;
on the other hand, these methods might be trapped at local optima. Improved mod-
els of the background sequence distribution using high order Markov models [23]
to represent the null distribution increase the accuracy. A score reecting how well
a DNA string matches each candidate motif can be calculated by taking into ac-
count its PWM, the background model and the number of motif occurrences for
each promoter sequence. One of the most commonly used scoring functions is the
following: let be a motif of length w, and occurrences i in upstream sequence g;
the corresponding PWM is M , of size wn, where n = 4, one column for each
possible nucleotide. We calculate the motif probability on M , P ( i jM ) and on
the background model, P ( i jM BKG ). Then, the scoring function is:
"
#
X
i
P ( i jM )
P ( i jM BKG )
S ;g = log 2
:
i=1
This formula is widely used e.g. MotifRegressor [24] and MotifScorer [25]. An-
other possible scoring function is the one proposed by [26], which does not require
Search WWH ::




Custom Search