Information Technology Reference
In-Depth Information
Predicting Protein Function and Structure
Using Bioinformatics Protocols:
A Case Study of the SAND Protein Family
Amanda COTTAGE
1
, Lisa J. MULLAN
2
, Miriam B.D. PORTELA
1
, Elizabeth HELLEN
1
,
Tim J. CARVER
3
, Sunil PATEL
4
, Tanya VAVOURI
1
, Greg ELGAR
1
, Yvonne J.K.
EDWARDS
5
1
MRC Rosalind Franklin Centre for Genomic Research, Genome Campus, Hinxton,
Cambridge, CB10 1SB, UK.
2
EMBL - European Bioinformatics Institute, Genome Campus,
Hinxton, Cambridge, CB10 1SD, UK.
3
Wellcome Trust Sanger Institute, Genome Campus,
Hinxton, Cambridge, CB10 1SA, UK.
4
Accelrys Inc., 334 Cambridge Science Park, Milton
Road, Cambridge, CB4 OWN, UK.
5
Comparative Genomics & Bioinformatics, School of
Biological and Chemical Sciences, Queen Mary, University of London, Mile End Road,
London E1 4NS, UK
Abstract. In this chapter, bioinformatics techniques are used to gain some insights
into the structure and function of a largely uncharacterised protein family called
SAND. From a phylogenomics analysis, we determine SAND as a eukaryotic gene
and show that a duplication event gave rise to two SAND genes in vertebrates.
SAND was found to be absent from archea and bacteria. From a phylogenetic
analysis, we characterise a number of subfamilies. With the use of multiple sequence
alignments, we highlight amino acids and sequence motifs conserved in SAND
proteins plus those invariant in subfamilies or taxonomical groups. In addition, we
predict a secondary structure and solvent accessibility profile and carry out protein
fold predictions for the SAND proteins.
Introduction
Predicting protein structure from sequence often involves tailored sequence similarity
searches against specialised databases. For example, carrying out a BLASTP search against
NRL3D (a databank of protein sequences of known structures), or a PSI-BLAST search
against a non-redundant protein databank, or a HMMER search against PFAM (Tables 1-
3). Protein structure prediction could also include performing multiple sequence
alignments, secondary structure predictions, solvent accessibility predictions, protein fold
recognition, constructing models to atomic resolution and model validation. In many cases,
not all protein structure prediction projects involve the use of all these techniques. The key
or most central part of a typical protein structure prediction is to identify a structural target
from which to extrapolate three-dimensional information for a query sequence. If this
central part is in error, the whole prediction will be incorrect. This is the most crucial part
of the project.