Biology Reference
In-Depth Information
multicellular prokaryotes (nematode Caenorhabditis elegans and the fruitfl y Drosophila melanogaster )
shared by prokaryotes or archaea. Furthermore, another updated version of the database contained
eukaryotic orthologous groups of proteins, named as KOGs with 59, 838 proteins distributed into
4,852 KOGs representing 110,655 analyzed eukaryotic gene products from three animals ( C . elegans ,
D . melanogaster and Homo sapiens ), one plant ( Arabidopsis thaliana ) and two fungi ( Saccharomyces
cerevisiae and Schizosaccharomyces pombe ) and the intracellular microsporidian parasite Encephditozoon
cuniculi (Tatusov et al ., 2003). Due to these developments, the COG database has become the main
platform for functional annotation of newly sequenced genomes and for drawing conclusions on
genome evolution. The role of COG database in comparative and functional genomics has been
elaborated by Kaufmann (2006). Functionally, the COGs have been classifi ed into 18 broad categories.
Once the category of COG is known, it is easier to predict the function of new protein sequence on
the basis of known function of existing COGs because it is very likely that the new sequence also
exhibits the same or relatively similar cellular function as exhibited by other members of the same
COG. Besides the COGNITOR, other tools of COG database comprise phylogenetic patterns search,
extended phylogenetic patterns search, phylogenetic COG ranking, gene or domain fusion and gene
context tool. The COG database will help us in identifying the core genes or minimal genome (genes
with orthologues), conserved hypothetical proteins, PACE proteins (proteins of archaea conserved
in eukaryotes) and orphan ORFs (ORFans) that do not match with any known sequence. When two
genomes are compared, clusters of orthologous groups (COGs) of proteins are identifi ed by the
BLASTp analysis against COG reference data set and the variable genes (genes without orthologues)
in the two organisms/strains under comparison are separated. The updated version of the COGs for
unicellular organisms and the eukaryotic KOGs are accessible at http:// www.ncbi.nlm.nih.gov/
COG/ and via ftp at ftp://ftp.ncbi.nih.gov/pub/COG/, respectively.
A similarity search of the total protein-encoding genes is conducted against databases and the
genes are classifi ed into different categories. These are: (i) protein-coding genes whose function can
be predicted, (ii) protein-coding genes without function prediction, (iii) genes without function with
similarity, (iv) genes without function without similarity, (v) protein-coding genes coding signal
peptides and (vi) protein-coding genes coding transmembrane proteins. The overall percentage
coding capacity of the genome is calculated based on the percentage of genes that encode proteins
with probable function. The putative protein-encoding genes (or open reading frames, ORFs) are
generally identifi ed by the start codons such as ATG,GTG,TTG or ATT and these are then denoted
by giving a serial number with three letters. The fi rst of these letters represents species name, the
second letter specifi es the length of ORF (if longer than 100 codons it is denoted by 'l', if shorter
than 100 codons then by ('s') and the third letter represents the reading direction on the circular map
(if towards right it is denoted by 'r' and if it is towards left it is represented by 'l'). For example, in
Synechocystis sp. strain PCC 6803 a putative protein encoding gene or ORF, sll0163 suggests that
the Synechocystis ( s ) gene is longer than 100 codons ( l ) and is read from left side ( l ) and 0163 is the
serial number of that ORF. Depending on the total number of genes so identifi ed one can calculate
the gene density that is the presence of one gene in relation to number of bp in a particular genome.
Specifi c gene locations on the chromosomes are designated by mentioning the coordinates of bp
in which the gene or ORF is located. For example, four copies rRNA gene clusters in the genome
of Anabaena sp. strain PCC 7120 occur in the order of 16S-23S-5S at coordinates 2,375,734-2,302,211;
2,500,525-2,505,531; 4,919,771-4,914,765 and 5,947,188-5,942,409, respectively. A convenient method
to identify the site of origin and termination of replication is to fi nd out a shift in GC content known
as GC skew. The leading strand is generally found to contain more guanine than cytosine residues.
This fact is used to predict the origin and terminus locations. It is represented by a sum of (G-C)/
Search WWH ::




Custom Search