Haplotype Inference Using Propositional Satisfiability - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

complemented sites can also be discarded. Two sites are complemented if whenever

one site has value 1, the other site has value 0 and vice versa.

By keeping information of the genotypes and sites discarded, it is straightforward

to construct a solution for the original set of genotypes once discovered a solution

to the simplified set of genotypes.

Example 7.4. (Structural Simplifications) Consider the following set of four geno-

types each with five sites:

G Df 02122; 12021; 02122; 20201 g . Note that the first

genotype is equal to the third genotype. Hence, the third genotype can be discarded.

In addition, note that the second and the fourth sites are equal. Also the first and the

third sites are complemented. Hence, the third and the fourth sites can be removed.

Consequently, the simplified set of genotypes is a set with three genotypes and three

sites: simplified . G / Df 022; 121; 201 g .

7.5.2

Lower Bounds

The computation of tight bounds is a key issue in several solvers, e.g., SHIPs

and HAPAR. This section describes two algorithms used to compute lower bounds

which have been proposed with the SHIPs algorithm [ 29 , 31 ].

The first algorithm relies on the incompatibility between genotypes [ 29 ]. Recall

that two genotypes are incompatible if there exists a site for which the value of one

genotype is 0 and the other genotype is 1. Otherwise, the genotypes are compatible .

Clearly, incompatible genotypes cannot be explained by common haplotypes. The

first lower bound procedure consists in finding a clique of incompatible genotypes

in a graph where each vertex is a genotype, and there is an edge between two geno-

types if they are incompatible. If the clique has k genotypes, then the number of

haplotypes required is guaranteed to be equal or greater than 2k ,where is the

number of homozygous genotypes in the clique.

The clique lower bound previously described can be improved taking into ac-

count the structure of the genotypes [ 31 ]. Let G S denote the set of selected geno-

types. At the beginning, G S corresponds to the set of genotypes in the clique of

incompatible genotypes. The goal is to find heterozygous sites g ij in genotypes

which do not belong to G S , such that all genotypes g 2 G S compatible with g i

have the same homozygous value in that position. Define the propositional function

.i; k/ to be true if g i and g k are compatible. (Clearly, is a symmetric relation.)

Thus, the goal is to find a genotype g i …

G S such that there exists a position j

where g ij

2 and a value val

2f 0; 1 g

such that, for all g k 2

G S ,if.i; k/

then g kj D val, i.e.,

9 1 j m .g ij D 2

^9 val 2f 0;1 g 8 g k 2 G S ..i; k/ ) g kj D val//:

(7.19)

Then G S D G S [f g i g and the lower bound can be increased by one. The process is

iterated until there are no more genotypes g i … G S which satisfy ( 7.19 ).

The interest in integrating a lower bound in SHIPs is twofold. First, the number

of iterations of the algorithm is reduced. Second, and more important, the size of the

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home