Database Reference
In-Depth Information
The
CIGAR
explains how the individual nucleotides in the DNA sequence map to the ref-
erence.
There is a stark mismatch between the SAM/BAM
row-oriented
on-disk format and the
column-oriented
access patterns common to genome analysis. Consider the following:
▪ A range query to find data for a particular gene linked to breast cancer, named
BRCA1: “Find all reads that cover chromosome 17 from position 41,196,312 to
41,277,500”
▪ A simple filter to find poorly mapped reads: “Find all reads with a
MapQ
less than
10”
▪ A search of all reads with insertions or deletions, called
indels
: “Find all reads
that contain
I
or
D
in the
CIGAR
string”
▪ Count the number of unique
k
-mers: “Read every
Sequence
and generate all
possible substrings of length
k
in the string”
Parquet's predicate pushdown feature allows us to rapidly filter reads for analysis (e.g.,
finding a gene, ignoring poorly mapped reads). Projection allows for precise materializa-
tion of only the columns of interest (e.g., reading only the sequences for
k
-mer counting).
Additionally, a number of the fields have low cardinality, making them ideal for data com-
pression techniques like run-length encoding (RLE). For example, given that humans have
only 23 pairs of chromosomes, the
Reference
field will have only a few dozen unique
values (e.g.,
chromosome1
,
chromosome17
, etc.). We have found that storing BAM
records inside Parquet files results in ~20% compression. Using the
PrintFooter
com-
mand in Parquet, we have found that quality scores can be run-length encoded and bit-
packed to compress ~48%, but they still take up ~70% of the total space. We're looking
forward to Parquet 2.0, so we can use delta encoding on the quality scores to compress the
file size even more.
A simple example:
k
-mer counting using Spark and ADAM
Let's do “word count” for genomics: counting
k
-mers. The term
k-mers
refers to all the
possible subsequences of length
k
for a read. For example, if you have a read with the se-
quence
AGATCTGAAG
, the 3-mers for that sequence would be
['AGA', 'GAT',
'ATC', 'TCT', 'CTG', 'TGA', 'GAA', 'AAG']
. While this is a trivial ex-
ample,
k
-mers are useful when building structures like De Bruijn graphs for sequence as-
sembly. In this example, we are going to generate all the possible 21-mers from our reads,
count them, and then write the totals to a text file.