Graph-based Approaches for Motif Discovery - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

great potential in uncovering new insight in a variety of biological systems.

A problem of great interest that explores non-coding regions of DNA for se-

quence motifs is one of transcription factor binding site identification. Transcrip-

tion factors are cellular proteins that take part in regulating gene expression, serv-

ing as activators or inhibitors of transcription. Uncovering their binding targets is

a crucial step towards understanding the mechanisms of transcriptional activity in

the cell. Biological approaches, which include both in-vivo [20] and in-vitro [31]

experimental designs, to identifying DNA binding sites are still time-consuming,

costly, sensitive to perturbation and often imprecise in pinpointing the exact loca-

tions of binding sites. It is clear that computational methods are needed to address

this very important problem.

Computational discovery of transcription factor binding sites is typically cast

as the problem of finding mutually similar substrings in unaligned sequence data.

We refer to this task as the motif finding or motif discovery problem. Motif finding

algorithms operate on sets of sequences that are presumed to possess a common

motif. Two orthogonal approaches exist: one attempts to identify binding sites

among a set of regulatory regions of orthologous genes across genomes of vary-

ing phylogenetic distance [3, 7, 16, 29, 30], referred to as phylogenetic footprint-

ing , and the other analyzes regulatory sequences for sets of genes from a single

genome assumed to be controlled by a common transcription factor. While in

the first case data are collected via gene orthology determination, in the second

case data are made available through DNA microarray studies [40, 43], chromatin

immunoprecipitation (ChIP-chip) experiments [20] and protein binding microar-

rays [31]. In DNA microarray studies normalized gene expression levels of many

genes are analyzed to reveal similar patterns in expression. Under the assumption

that co-expressed genes are likely co-regulated, the regulatory regions of such co-

expressed genes can be subjected to motif finding. In the latter approaches the

binding of a regulatory protein to DNA is recognized directly via molecular meth-

ods. The group of DNA sequences, to which the protein was bound, can be input

to a motif finding algorithm to identify the binding sites precisely.

There are two broad categories of motif finding algorithms, and they are fun-

damentally linked to the choice of the underlying motif representation [33, 41].

One class of algorithms is based on the consensus motif model, and the search

strategies focus on finding word-based patterns via various approaches including

enumerative [12, 15, 27, 34, 39, 44] and clustering [5, 35] methods. The proba-

bilistic algorithms, based on the position-specific scoring matrix model ( PSSM ),

use greedy strategies [11] or parameter estimation techniques, such as Expectation

Maximization or Gibbs Sampling [14, 18, 19, 24, 38] to maximize information

content of the sought motif.

Search WWH ::

Custom Search

Home