Biomedical Engineering Reference
In-Depth Information
In working with databases, it is important to re-
cognize their inherent deficiencies. As well as errors
in the sequencing process itself, there can be tran-
scriptional errors when the data are transferred
from laboratory notebook to publications and data-
bases. For example, when screening 300 human
protein sequences in the SWISS-PROT database that
had been published separately more than once, Bork
(1996) found that 0.3% of the amino acids were
different. This is a lower limit, for many corrections
will already have been made and in many instances
the sequences appearing in two different publica-
tions are not independent. Note that only stop
codons and frame shifts can be detected unambigu-
ously: point mutations are hard to verify, as natural
polymorphisms or strain differences cannot easily be
excluded. Sequencing by hybridization may be of
great use here. Other database problems include
misspelling of genes, resulting in confusion with
ones of similar name or the generation of synonyms:
different genes being given the same name and
multiple synonyms for the same gene. Examples of
the latter are the E. coli gene hns, which has eight
synonyms, and the protein annexin V, which has
five synonyms.
such as consensus sequences for TATA boxes. All
the information so gathered is integrated to make as
coherent a picture as possible.
When analysing sequences from eukaryotes, it is
best to locate and remove interspersed repeats before
searching for genes. Not only can repeats confuse
other analyses, such as database searches, but they
provide important negative information on the loca-
tion of gene features. For example, such repeats
rarely overlap the promoters or coding portions of
exons. Once this is done, the next step is to identify
open reading frames (ORFs). Despite the availability
of sophisticated software search routines, unam-
biguous assignment of ORFs is not easy. For exam-
ple, the Haemophilus influenzae genome sequence
submitted to the database included 1747 predicted
protein-encoding genes (Fleischmann et al. 1995).
When Tatusov et al. (1996) reanalysed the sequence
data using different algorithms and different dis-
crimination criteria, they identified a new set of
1703 putative protein-encoding genes. In addition
to 1572 ORFs, which remained the same, they
identified 23 new ORFs, modified 107 others and
discarded the balance. Note that gene finding
is relatively easy in compact and almost intron-free
genomes, such as yeast. In higher plants and ani-
mals, the task is much greater, for a 2 kb ORF could
be split into 15 exons spread over 30 kb of genomic
DNA.
Analysing sequence data
Discovering new genes and their functions is a key
step in analysing new sequence data. The process is
facilitated by special-purpose gene-finding software,
by searches in key databases and by programs for
finding particular sites relevant to gene expression,
e.g. splice sites and promoters. Unfortunately, no
one software package contains all the necessary
tools. Rather, optimal gene finding is dependent on
combining evidence derived from use of multiple
software tools (Table 7.2).
Fickett (1996) has described a framework for
finding genes which makes use of a number of differ-
ent software programs. Evidence for the presence or
absence of genes in a sequence is gathered from a
number of sources. These include sequence similarity
to other features, such as repeats, which are unlikely
to overlap protein-coding sequences, sequence sim-
ilarity to other genes, statistical regularity evincing
apparent codon bias over a region, and matches to
template patterns for functional sites on the DNA,
Database searches
Searching for a known homologue is the most
widely used means of identifying genes in a new
sequence. If a putative protein encoded by an
uncharacterized ORF shows statistically significant
similarity to another protein of known function, this
simultaneously proves, beyond doubt, that the ORF
in question is a bona fide new gene and predicts its
likely function. Even if the homologue of the new
protein has not been characterized, useful informa-
tion is produced in the form of conserved motifs that
may be important for protein function. In this way,
Koonin et al. (1994) analysed the information con-
tained in the complete sequence of yeast chromo-
some III and found that 61% of the probable gene
products had significant similarities to sequences in
the current databases. As many as 54% of them had
Search WWH ::




Custom Search