Haplotype Inference Using Propositional Satisfiability - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

If g kj ¤ 2 ^ g ij

D 2,thenR D .g kj , .p , a// and S

D t ij .

If g ij D 2 ^ g kj D 2,thenR D .p , q/ and S D .t ij , t kj /.

Moreover, in order to count the number of distinct haplotypes used, Boolean

variables u i

are defined such that u i

is assigned value 1 if haplotype h i

explaining

genotype g i is different from every haplotype which explain genotype g k with k<i.

The conditions on variables u i

are based on the conditions for the x i

variables,

x pa

^

ik ^ x pb

) u i :

(7.16)

ik

1 k<i

Finally, the cost function is given by

u i C u i

:

n X

min

(7.17)

i D 1

Theorem 7.2. (Space Complexity) Let n be the number of genotypes in

, m the

number of positions of each genotype and the number of heterozygous positions

of the instance. Then, the number of variables of the PBO model is

G

O .n 2 C / ,

O .n 2 C nm/ . In addition, the number of constraints of the

which corresponds to

O .n 2 m/ .

PBO model is

The RPoly model has been further improved, following the ideas used by SHIPs.

In particular, the utilization of lower bounds as a method for pruning the search

space shows to be very effective. The integration of a tight lower bound (Sect. 7.5.2 )

allows fixing the values of some variables u , and the clauses used for constraining

the value of u need not be generated. This allows the PBO solver to focus on the

remaining u variables, and the size of the generated PBO problem instances becomes

significantly smaller. For the more complex problem instances, the integration of

lower bound information reduces the size of the generated PBO instances by a factor

between 2 and 3, on average.

Finally, an additional improvement is the inclusion of cardinality constraints on

the x variables. Adding new constraints to a model is expected to prune the search

and therefore contributes to the efficiency of the solver.

7.4.4.1

Missing Data

Most often real genotype data contain a significant percentage of unknown data.

Even with modern automated DNA analysis techniques, generating data with miss-

ing alleles is not an uncommon situation [ 25 ].

One useful feature of the RPoly tool is to be able to deal with unspecified

genotype sites. Genotyping procedures often leave a percentage of missing geno-

type positions, and so haplotype inference tools need to be able to deal with missing

sites. RPoly can handle SNPs with unspecified values, inferring the values for the

missing sites and still guaranteeing a parsimonious solution.

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home