Biology Reference
In-Depth Information
software) as well as sequencing of mapped, unique polymerase chain reaction (PCR)
products from freshly prepared genomic DNA.
The fi nal step required to create a fi nished single chromosomal sequence was to
determine the number of tandem repeats for a 672 base DNA sequence of unknown
length. This was done by creating the full tandem repeat insert from unique upstream
and downstream primers using long-range PCR. We then determined the size of prod-
uct (amplifi ed DNA) between the unique sequences.
Protein Sequence Predictions/Open Reading Frames (ORFs)
Annotation done at Oak Ridge National Laboratory consisted of gene calls using
CRITICA [11], glimmer [12], and Generation [http://compbio.ornl.gov/webcite]. An-
notation at the Virtual Institute for Microbial Stress and Survival (VIMSS) [http://
www.microbesonline.org/webcite] used bidirectional best hits as well as recruitment
to TIGRfam HMMs, as described in Alm et al. [13]. Briefly, protein coding predic-
tions derived from NCBI, or identified using CRITICA, with supplemental input from
Glimmer, were analyzed for domain identities using the models deposited in the Inter-
Pro, UniProt, PRODOM, Pfam, PRINTS, SMART, PIR SuperFamily, SUPERFAMI-
LY, and TIGRfam databases [13]. Orthologs were identified using bidirectional unique
best hits with greater than 75% coverage. The RPS-BLAST against the NCBI COGs
in the CDD database were used to assign proteins to COG models when the best hit
E-value was <1e-5 and coverage was >60%.
Manual Curation
Each and every predicted protein in the VIMSS database [http://www.microbesonline.
org/webcite] [13] was assessed to compare insights obtained from recruitment to mod-
els from several databases (TIGRfams, COGs, EC, and InterPro). Assignments that of-
fered the most definitive functional assignment were captured in an excel spreadsheet
with data entries for all proteins predicted in the VIMSS database. Extensive manual
curation of the predicted protein set was carried out using a combination of tools in-
cluding the VIMSS analysis tools, creation, and assessment of HMMs, and phyloge-
nomic analysis, as described. Changes in gene functional predictions and naming were
captured in the excel spreadsheet, and predictions with strong phylogenetic evidence
of function posted using the interactive VIMSS web-based annotation interface.
Phylogenomic Analysis: Flower Power, SCI-PHY, and HMM Scoring
The HMMs were generated for a large subset of proteins of interest, as detailed, to pre-
dict functional classification with the highest confidence measures currently available.
The HMMs allowed recruitment of proteins to phylogenetic tree alignments that most
closely reflect evolutionary relatedness across species. The proteins were assembled
within clades of proteins that are aligned along their full length (no missing functional
domains), and that allow high confidence of shared function in each species.
Gene Family Expansion
A clustered set of paralogs was used to search for recent gene duplication events.
After an initial assessment of the VIMSS gene information/homolog data, candidate
 
Search WWH ::




Custom Search