Database Reference
In-Depth Information
expressive, and customized to express a variety of genomic models such as structural vari-
ants, genotypes, variant calling annotations, variant effects, and more.
UC Berkeley is a member of the
Global Alliance for Genomics & Health
, a non-govern-
mental, public-private partnership consisting of more than 220 organizations across 30 na-
tions, with the goal of maximizing the potential of genomics medicine through effective
and responsible data sharing. The Global Alliance has embraced this literate programming
approach and publishes
its schemas
in Avro IDL as well. Using Avro has allowed re-
searchers around the world to talk about data at the logical level, without concern for
computer languages or on-disk formats.
Column-oriented access with Parquet
together as a single line of text or a binary record. (See
Other File Formats and Column-
Oriented Formats
for further discussion of row- versus column-oriented formats.) A single
paired-end read in a SAM file might look like this:
read1 99 chrom1 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
read1 147 chrom1 37 30 9M = 7 -39 CAGCGGCAT *
NM:i:1
A typical SAM/BAM file contains many millions of rows, one for each DNA read that
came off the sequencer. The preceding text fragment translates loosely into the view
shown in
Table 23-3
.
Table 23-3. Logical view of SAM fragment
Name
Reference
Position MapQ
CIGAR
Sequence
7
30
read1
chromosome1
8M2I4M1D3M
TTAGATAAAGGATACTG
37
30
read1
chromosome1
9M
CAGCGGCAT
In this example, the read, identified as
read1
, was mapped to the reference genome at
chromosome1
, positions 7 and 37. This is called a “paired-end” read as it represents a
single strand of DNA that was read from each end by the sequencer. By analogy, it's like
reading an array of length 150 from
0..50
and
150..100
.
The
MapQ
score represents the probability that the sequence is mapped to the reference
correctly.
MapQ
scores of 20, 30, and 40 have a probability of being correct of 99%,
99.9%, and 99.99%, respectively. To calculate the probability of error from a
MapQ
score,
use the expression 10
(-MapQ/10)
(e.g., 10
(-30/10)
is a probability of 0.001).