Biology Reference
In-Depth Information
great potential in uncovering new insight in a variety of biological systems.
A problem of great interest that explores non-coding regions of DNA for se-
quence motifs is one of transcription factor binding site identification. Transcrip-
tion factors are cellular proteins that take part in regulating gene expression, serv-
ing as activators or inhibitors of transcription. Uncovering their binding targets is
a crucial step towards understanding the mechanisms of transcriptional activity in
the cell. Biological approaches, which include both in-vivo [20] and in-vitro [31]
experimental designs, to identifying DNA binding sites are still time-consuming,
costly, sensitive to perturbation and often imprecise in pinpointing the exact loca-
tions of binding sites. It is clear that computational methods are needed to address
this very important problem.
Computational discovery of transcription factor binding sites is typically cast
as the problem of finding mutually similar substrings in unaligned sequence data.
We refer to this task as the motif finding or motif discovery problem. Motif finding
algorithms operate on sets of sequences that are presumed to possess a common
motif. Two orthogonal approaches exist: one attempts to identify binding sites
among a set of regulatory regions of orthologous genes across genomes of vary-
ing phylogenetic distance [3, 7, 16, 29, 30], referred to as phylogenetic footprint-
ing , and the other analyzes regulatory sequences for sets of genes from a single
genome assumed to be controlled by a common transcription factor. While in
the first case data are collected via gene orthology determination, in the second
case data are made available through DNA microarray studies [40, 43], chromatin
immunoprecipitation (ChIP-chip) experiments [20] and protein binding microar-
rays [31]. In DNA microarray studies normalized gene expression levels of many
genes are analyzed to reveal similar patterns in expression. Under the assumption
that co-expressed genes are likely co-regulated, the regulatory regions of such co-
expressed genes can be subjected to motif finding. In the latter approaches the
binding of a regulatory protein to DNA is recognized directly via molecular meth-
ods. The group of DNA sequences, to which the protein was bound, can be input
to a motif finding algorithm to identify the binding sites precisely.
There are two broad categories of motif finding algorithms, and they are fun-
damentally linked to the choice of the underlying motif representation [33, 41].
One class of algorithms is based on the consensus motif model, and the search
strategies focus on finding word-based patterns via various approaches including
enumerative [12, 15, 27, 34, 39, 44] and clustering [5, 35] methods. The proba-
bilistic algorithms, based on the position-specific scoring matrix model ( PSSM ),
use greedy strategies [11] or parameter estimation techniques, such as Expectation
Maximization or Gibbs Sampling [14, 18, 19, 24, 38] to maximize information
content of the sought motif.
Search WWH ::




Custom Search