Information Technology Reference
In-Depth Information
matched reference set with the case and control data, respectively, we create a
reference set with 2/3 of individuals overlapping as a more realistic situation. We
could have chosen 3/4 or 4/5 overlapping. However, 2/3 seems to more realistic
and practical from the real data we have observed. The data set and procedure of
creating the reference sets is described in the following.
The phased individual level genotype data of 1,150 cases and 1,378 controls
from the GAIN Schizophrenia is used. A region of 50 MB-60 MB is selected from
the Chromosome 2 for the study. We choose Chromosome 2 for the reason that
the overall performance of association tests is the third worst among all 22
chromosomes (only better than Chromosome 9 and 22), and that the number of
SNPs available for our study is the second largest among all chromosomes
(second to Chromosome 1).
Two random samples of equal size are generated from the case data. Two
different scenarios for generating the random samples are studied.
Scenario One: There is no individual overlap between the two random
samples.
Scenario Two: There is exactly 2/3 of individual overlaps between the two
random samples.
In the first scenario, each cohort is randomly split into two halves so that there is
no overlap between the two subsets. In the second scenario, two subsets are
generated in such way that 50% of the individuals randomly selected from the
original data belonging to both subsets and the remaining 50% individuals are split
randomly into each of the two subsets. We use one of the subset as the reference
set and the other as the study sample.
In order to imitate the imputation process of MiDCoP, we compute the allele
frequencies for the SNPs in the study sample. Then, the allele frequencies of each
of the SNPs in the study sample are imputed using the MiDCoP approach by
assuming they are missing. This process is carried out for both case and control
data. The p-values of the Chi-square tests from the actual and imputed allele
frequencies are compared using the linear regression for SNPs in different
pairwise LD groups. The process of selecting the subsets and computing the
association test are repeated for fifty times (number of simulations) under both
scenarios. The coefficient of determination (R 2 ), intercept, and slope of the linear
fit are obtained from each of the 50 simulations. Summary statistics (mean,
standard deviation, maximum, and minimum) are computed from the regression
statistics under both scenarios. Table 2 and Table 3 summarize the results for the
first and second scenario, respectively.
Comparison of results from Table 2 and Table 3 suggests that matching the
sample with the reference set leads to higher imputation accuracy. Note that
the conditional probability, P(X|A-B), used in the MiDCoP approach to impute the
allele frequency in both scenarios are computed from different reference sets for
case and control.
Search WWH ::




Custom Search