Database Reference
In-Depth Information
Table 23-2. Mandatory fields in the SAM format
Col Field
Type
Regexp/Range
Brief description
1
Query template NAME
QNAME String [!-?A-~]{1,255}
[0, 2
16
-1]
2
bitwise FLAG
FLAG
Int
3
Reference sequence NAME
RNAME String \*|[!-()+-<>-~][!-~]*
[0,2
31
-1]
4
1-based leftmost mapping POSition
POS
Int
[0,2
8
-1]
5
MAPping Quality
MAPQ
Int
6
CIGAR String \*|([0-9]+[MIDNSHPX=])+
CIGAR string
7
RNEXT String \*|=|[!-()+-><-~][!-~]*
Ref. name of the mate/NEXT read
[0,2
31
-1]
8
Position of the mate/NEXT read
PNEXT Int
[-2
31
+1,2
31
-1]
9
observed Template LENgth
TLEN
Int
10
segment SEQuence
SEQ
String \*|[A-Za-z=.]+
11
ASCII of Phred-scaled base QUALity+33
QUAL
String [!-~]
Any developers who want to implement this specification need to translate this English
spec into their computer language of choice. In ADAM, we have chosen instead to use lit-
erate programming with a spec defined in Avro IDL. For example, the mandatory fields
for SAM can be easily expressed in a simple Avro record:
record AlignmentRecord {
string qname;
int flag;
string rname;
int pos;
int mapq;
string cigar;
string rnext;
int pnext;
int tlen;
string seq;
string qual;
}
Avro is able to autogenerate native Java (or C++, Python, etc.) classes for reading and
writing data and provides standard interfaces (e.g., Hadoop's
InputFormat
) to make
integration with numerous systems easy. Avro is also designed to make schema evolution
easier. In fact, the
ADAM schemas
we use today have evolved to be more sophisticated,