An Evaluation of the MiDCoP Method for Imputing Allele Frequency in Genome Wide Association Studies - Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Information Technology Reference

In-Depth Information

the estimating accuracy of missing SNPs, and concluded that the correlation

between the actual and imputed allele frequencies is higher than .9925 in the

study. In this article, we will perform further evaluation of the method by

evaluating the performance of the association tests using the case-control data of

Genotype Association Information Network (GAIN) Schizophrenia study in

European American Population. Section 2 gives a brief description of the MiDCoP

method. Section 3 briefly describes the GAIN data and the general algorithm

implemented for our evaluation. Section 4 summarizes the performances of the

association statistics based on imputed and actual allele frequencies of SNPs using

the GAIN data. Section 5 compares the association test results using different

reference data sets. Section 6 gives a brief summary and conclusion.

2

The MiDCoP Method

The idea behind the Mimimum Deviation of Conditional Probability method

(MiDCoP) is to impute the allele frequencies of untyped SNPs in the study sample

by utilizing the allele frequencies of neighboring SNPs and haplotype frequencies

from an external reference set such as the HapMap reference set (The International

HapMap Project [11]). The best pair of the neighboring SNPs is determined by

maximizing certain multilocus information score (MIS). Gautam [10] proposed

five different MISs. In this article, we will adopt the best MIS recommended in

[10], namely, the Mutual Information Ratio (MIR, [12]). The algorithm of the

(MiDCoP) derived by Gautam [10] consists of the following three steps:

1)

SNPs Selection: Identify a set of flanking SNPs in the neighborhood of the

untyped SNP X that maximize MIR based on reference set. Let L = {L 1 , L 2 ,

…, L u } be the sequence of SNPs common to both reference set and sample

set in the neighborhood of X, and are in linkage disequilibrium with X. Our

goal is to obtain a pair {A, B}

⊆

L such that the obtained MIR between

{A, B} and {A, X, B} in the reference set is maximized for the fixed SNP

X. Here, the order of SNPs {A, B} does not need to be in the sequential

order based on their base pair position.

2) Haplotype Frequency Estimation: Once the optimal pair {A,B} is

determined from step 1, this step estimates the haplotype frequency for the

pair {A, B} in the sample.

3)

Allele Frequency Estimation: The allele frequency of untyped SNP X in the

sample is estimated as the weighted sum of the haplotype frequency

estimated in the step 2.

The Mutual Information Ratio (MIR) is defined in the following. Let S = {S 1 ,

S 2 , …, S n } and T = {T 1 , T 2 , …, T m } be two disjoint sets of n and m (bi-allelic)

SNPs with the population haplotype frequencies given by the vectors

(

)

and

, respectively. The unknown parameters

ϕϕϕ

=

,

…

,

ϕ

(

)

θθθ

=

1 ,, , t

…

θ

12

s

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Search WWH ::

Custom Search

Home