Database Reference
In-Depth Information
expressive, and customized to express a variety of genomic models such as structural vari-
ants, genotypes, variant calling annotations, variant effects, and more.
UC Berkeley is a member of the Global Alliance for Genomics & Health , a non-govern-
mental, public-private partnership consisting of more than 220 organizations across 30 na-
tions, with the goal of maximizing the potential of genomics medicine through effective
and responsible data sharing. The Global Alliance has embraced this literate programming
approach and publishes its schemas in Avro IDL as well. Using Avro has allowed re-
searchers around the world to talk about data at the logical level, without concern for
computer languages or on-disk formats.
Column-oriented access with Parquet
The SAM and BAM [ 164 ] file formats are row-oriented : the data for each record is stored
together as a single line of text or a binary record. (See Other File Formats and Column-
Oriented Formats for further discussion of row- versus column-oriented formats.) A single
paired-end read in a SAM file might look like this:
read1 99 chrom1 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
read1 147 chrom1 37 30 9M = 7 -39 CAGCGGCAT *
NM:i:1
A typical SAM/BAM file contains many millions of rows, one for each DNA read that
came off the sequencer. The preceding text fragment translates loosely into the view
shown in Table 23-3 .
Table 23-3. Logical view of SAM fragment
Name
Reference
Position MapQ
CIGAR
Sequence
7
30
read1
chromosome1
8M2I4M1D3M
TTAGATAAAGGATACTG
37
30
read1
chromosome1
9M
CAGCGGCAT
In this example, the read, identified as read1 , was mapped to the reference genome at
chromosome1 , positions 7 and 37. This is called a “paired-end” read as it represents a
single strand of DNA that was read from each end by the sequencer. By analogy, it's like
reading an array of length 150 from 0..50 and 150..100 .
The MapQ score represents the probability that the sequence is mapped to the reference
correctly. MapQ scores of 20, 30, and 40 have a probability of being correct of 99%,
99.9%, and 99.99%, respectively. To calculate the probability of error from a MapQ score,
use the expression 10 (-MapQ/10) (e.g., 10 (-30/10) is a probability of 0.001).
Search WWH ::




Custom Search