Database Reference
In-Depth Information
The CIGAR explains how the individual nucleotides in the DNA sequence map to the ref-
erence. [ 165 ] The Sequence is, of course, the DNA sequence that was mapped to the ref-
erence.
There is a stark mismatch between the SAM/BAM row-oriented on-disk format and the
column-oriented access patterns common to genome analysis. Consider the following:
▪ A range query to find data for a particular gene linked to breast cancer, named
BRCA1: “Find all reads that cover chromosome 17 from position 41,196,312 to
41,277,500”
▪ A simple filter to find poorly mapped reads: “Find all reads with a MapQ less than
10”
▪ A search of all reads with insertions or deletions, called indels : “Find all reads
that contain I or D in the CIGAR string”
▪ Count the number of unique k -mers: “Read every Sequence and generate all
possible substrings of length k in the string”
Parquet's predicate pushdown feature allows us to rapidly filter reads for analysis (e.g.,
finding a gene, ignoring poorly mapped reads). Projection allows for precise materializa-
tion of only the columns of interest (e.g., reading only the sequences for k -mer counting).
Additionally, a number of the fields have low cardinality, making them ideal for data com-
pression techniques like run-length encoding (RLE). For example, given that humans have
only 23 pairs of chromosomes, the Reference field will have only a few dozen unique
values (e.g., chromosome1 , chromosome17 , etc.). We have found that storing BAM
records inside Parquet files results in ~20% compression. Using the PrintFooter com-
mand in Parquet, we have found that quality scores can be run-length encoded and bit-
packed to compress ~48%, but they still take up ~70% of the total space. We're looking
forward to Parquet 2.0, so we can use delta encoding on the quality scores to compress the
file size even more.
A simple example: k -mer counting using Spark and ADAM
Let's do “word count” for genomics: counting k -mers. The term k-mers refers to all the
possible subsequences of length k for a read. For example, if you have a read with the se-
quence AGATCTGAAG , the 3-mers for that sequence would be ['AGA', 'GAT',
'ATC', 'TCT', 'CTG', 'TGA', 'GAA', 'AAG'] . While this is a trivial ex-
ample, k -mers are useful when building structures like De Bruijn graphs for sequence as-
sembly. In this example, we are going to generate all the possible 21-mers from our reads,
count them, and then write the totals to a text file.
Search WWH ::




Custom Search