Biology Reference
In-Depth Information
ways that could be understood and used by biologists. For example,
the most immediate problem was how to represent huge objects like
genomes in a useful way. Although AceDB provided a model for many
of their ideas, the EBI team decided to replace its hierarchical fi le struc-
ture with a relational database management system that could provide
more adaptability to the large data volumes associated with the human
genome. 28
The basic unit of representation in Ensembl, the “contig,” was dic-
tated by the way sequence was produced in high-throughput sequencing
machines. A contig represented a contiguous piece of sequence from a
single bacterial artifi cial chromosome (BAC)—the laboratory construct
used to amplify and sequence the DNA. The relationship between the
contigs and the other information in the database is shown in fi gure 6.3.
The letters of the DNA sequence itself are stored in a table called “dna,”
while the instructions for assembling a complete chromosome from sev-
eral contigs are contained in a separate table called “assembly.” 29 As the
Ensembl developers explain, “a gene from Ensembl's perspective is a set
of transcripts that share at least one exon. This is a more limited defi ni-
FIGURE 6.3 Ensembl database schema. The “contig” is the central object; each contig corresponds to
one DNA sequence (listed in the table “dna sequence”) but may have multiple features (“dna_align
_feature”) and exons (“exon”). Contigs can be put together into multiple assemblies (“assembly”) which
stitch together various contigs. (Stabenau et al., “Ensembl core software.” Reproduced with permission of
Cold Spring Harbor Laboratory Press.)
Search WWH ::




Custom Search