Biomedical Engineering Reference
In-Depth Information
This chapter starts by describing the haplotype inference problem, with special
focus on the pure parsimony approach. Follows an overview of the mathematical
models suggested for solving the problem. Later, the SAT-based haplotype infer-
ence model and the model's extension to handle polyploid species are detailed. In
addition, the pseudo-Boolean optimization (PBO) model and its extension to deal
with data with missing sites are presented. Moreover, this chapter summarizes an
experimental evaluation involving a considerable number of maximum parsimony
haplotyping algorithms. Furthermore, standard preprocessing techniques commonly
used by HIPP algorithms, which include structural simplifications on genotype in-
stances and calculation of bounds, are described.
7.2
Haplotype Inference
The genome constitutes the hereditary data of an organism and is encoded in
the DNA ( deoxyribonucleic acid ), which is specified by the sequence of bases
of nucleotides that represent the DNA structural units: A ( adenine ), C ( cytosine ),
T( thymine )andG( guanine ).
The coding part of the genome is organized in DNA segments called genes. Each
gene encodes a specific protein. The variants of a single gene are named alleles .
Despite the considerable similarity between our genes, no two individuals have the
same genome. The human genome has roughly three billion nucleotides, but about
99.9% of them are the same for all human beings. On average, the sequence of bases
of two individuals differ in one of every 1,200 bases, but the variations are not uni-
formly distributed along all the DNA. Variations in the DNA define the differences
between human beings and, in particular, influence their susceptibility to diseases.
Consequently, a critical step in genetics is the understanding of the differences be-
tween human beings. SNPs correspond to differences in a single position of the
DNA where mutations have occurred and present a minor allele frequency equal to
or greater than a given value (e.g., 1%).
SNPs which are close on the genome tend to be inherited together in blocks.
Hence, SNPs within a block are statistically associated, what is known as link-
age disequilibrium. These blocks of SNPs are known as haplotypes. Haplotypes are
therefore sequences of correlated SNPs (Fig. 7.1 ). Haplotype blocks exist because
the crossing-over phenomenon (exchange of genetic material between homologous
chromosomes during meiosis) does not occur randomly along the DNA, but it is
rather concentrated into small regions called recombination hotspots. Recombina-
tion does not occur in every hotspot at every generation. Consequently, individuals
within the same population tend to have large haplotype blocks in common. Further-
more, due to the association of SNPs, it is often possible to identify a small subset
of SNPs which identify the remaining SNPs within the haplotype ( tagSNPs )[ 24 ].
The human genome is organized into 22 pairs of homologous non-sex chromo-
somes, each chromosome being inherited from one parent. Due to technological
limitations, homologous chromosomes are not easy to sequence separately, and
Search WWH ::




Custom Search