Haplotype Inference Using Propositional Satisfiability - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

This chapter starts by describing the haplotype inference problem, with special

focus on the pure parsimony approach. Follows an overview of the mathematical

models suggested for solving the problem. Later, the SAT-based haplotype infer-

ence model and the model's extension to handle polyploid species are detailed. In

addition, the pseudo-Boolean optimization (PBO) model and its extension to deal

with data with missing sites are presented. Moreover, this chapter summarizes an

experimental evaluation involving a considerable number of maximum parsimony

haplotyping algorithms. Furthermore, standard preprocessing techniques commonly

used by HIPP algorithms, which include structural simplifications on genotype in-

stances and calculation of bounds, are described.

7.2

Haplotype Inference

The genome constitutes the hereditary data of an organism and is encoded in

the DNA ( deoxyribonucleic acid ), which is specified by the sequence of bases

of nucleotides that represent the DNA structural units: A ( adenine ), C ( cytosine ),

T( thymine )andG( guanine ).

The coding part of the genome is organized in DNA segments called genes. Each

gene encodes a specific protein. The variants of a single gene are named alleles .

Despite the considerable similarity between our genes, no two individuals have the

same genome. The human genome has roughly three billion nucleotides, but about

99.9% of them are the same for all human beings. On average, the sequence of bases

of two individuals differ in one of every 1,200 bases, but the variations are not uni-

formly distributed along all the DNA. Variations in the DNA define the differences

between human beings and, in particular, influence their susceptibility to diseases.

Consequently, a critical step in genetics is the understanding of the differences be-

tween human beings. SNPs correspond to differences in a single position of the

DNA where mutations have occurred and present a minor allele frequency equal to

or greater than a given value (e.g., 1%).

SNPs which are close on the genome tend to be inherited together in blocks.

Hence, SNPs within a block are statistically associated, what is known as link-

age disequilibrium. These blocks of SNPs are known as haplotypes. Haplotypes are

therefore sequences of correlated SNPs (Fig. 7.1 ). Haplotype blocks exist because

the crossing-over phenomenon (exchange of genetic material between homologous

chromosomes during meiosis) does not occur randomly along the DNA, but it is

rather concentrated into small regions called recombination hotspots. Recombina-

tion does not occur in every hotspot at every generation. Consequently, individuals

within the same population tend to have large haplotype blocks in common. Further-

more, due to the association of SNPs, it is often possible to identify a small subset

of SNPs which identify the remaining SNPs within the haplotype ( tagSNPs )[ 24 ].

The human genome is organized into 22 pairs of homologous non-sex chromo-

somes, each chromosome being inherited from one parent. Due to technological

limitations, homologous chromosomes are not easy to sequence separately, and

Search WWH ::

Custom Search

Home