Biology Reference
In-Depth Information
2. Motif Discovery in a Nutshell
Knowing the inputs and outputs is central to the understanding of a
computational problem. The data structures involved in motif discovery
are shown in Fig. 1. The input consists of a set of sequences, not neces-
sarily of fixed length. The output consists of a list of motif instances
and/or a motif description: a consensus sequence, a probability matrix,
or a weight matrix. The motif instances are subsequences of the input
sequence, and can be defined by a sequence name and a starting posi-
tion. For the type of motifs considered here, they are of fixed length.
The motif description and motif annotation are intertwined entities in
that the motif description defines the motif annotation of the input data
set, and the set of motif instances can be used to derive the motif
description.
A consensus sequence is a short sequence ( k- letter word) from the
DNA alphabet or from an extended alphabet containing IUPAC
(International Union of Pure and Applied Chemistry) codes for
incompletely specified bases in nucleotide sequences. 11 A threshold
number of mismatches may be permitted. The consensus sequence,
together with the maximal number of allowed mismatches, defines
the motif in a deterministic and qualitative manner. Specifically,
it defines the subset of all k- letter words which qualify as motif
instances.
The position-specific scoring matrix, introduced in its standard form
by Staden, 12 is a more flexible representation of a sequence motif. Its use
is motivated by the assumption that not all mismatches to consensus
sequences are equally detrimental. Therefore, the relative fit of a partic-
ular base to a given motif position is expressed by a number. Matrix
descriptions for sequence motifs come in two forms: base probability
matrices and additive scoring matrices, henceforth called weight matrices.
The former reflects the expected frequencies of each base at each position.
The latter serves to compute a motif score for a particular k -letter
sequence by adding up the matrix elements corresponding to all bases at
each position in the sequence. The parameters of a weight matrix can have
positive or negative values.
Search WWH ::




Custom Search