Biomedical Engineering Reference
In-Depth Information
status in the subjects, and we may wish to control for this effect before testing for
genetic effects.
An alternative analysis method that allows covariates to be included would be logistic
regression. Genotype status at each marker is coded as an ordinal variable representing
the number of copies of a fixed allele. The non-genetic base model then predicts
case status as a function of the k covariates (e.g. site in the example stated above):
ln[ P/( 1
P) ]
=
α
+
β 1 x 1 +
...
+
β k x k ,where P is the probability of being a case, and
x i ,i
1 ,...,k , are the covariate variables included. This base model is compared with
a full model ln[ P/( 1
=
βg ,where g is genotype at the
analyzed marker, coded ordinally (that is, coded as the number of copies of a fixed allele,
often the minor allele in the overall sample). The likelihood ratio chi-square statistic may
be computed to obtain the p -value for the effect of the genetic marker. Note that both the
base model and the full model must be computed for the same set of data; therefore, if
there are individuals having missing genotype at this marker or missing covariate variable
values, then they must be removed from the data prior to the logistic regression analysis.
Other codings of the genotype may be tested similarly Protocol 4.2 describes an analysis
based on logistic regression and the likelihood ratio chi-square. The user-friendly software
package PLINK [33] also implements logistic-regression-based tests of association.
P) ]
=
α
+
β 1 x 1 +
...
+
β k x k +
Multilocus association analysis
The first pass analysis of a large-scale case - control association study is usually single-SNP
analysis of the kind described in the previous section. Additional information can be gained
by multilocus analysis, looking at groups of markers in various ways. However, there is
often a penalty to be paid in the form of correction for the number of tests (see below).
For example, given 100 000 genotyped, bi-allelic SNPs, pairwise analyses of all possible
two-way interactions of (not necessarily contiguous) loci leads to (100 000)
×
(99 999)/2
=
4 999 950 000 tests.
One way to group multiple SNP loci for analysis is to consider haplotypes; that is,
the combination of alleles occurring at linked loci along a chromosomal strand. Haplotype
analyses have some biological justification. A commonly cited motivation for haplotype
analysis is that a disease-causing but un-genotyped variant may lie on a particular haplotype
background, and analysis of that haplotype will reveal the association. Alternatively, it may
be that the combined allelic state across the haplotype is biologically functional and causes
the disease. However, when testing haplotypes, the number of tests, and the degrees of
freedom for a given test, can rapidly increase. A typical approach is to use a 'sliding
window' of N SNPs for a range of values for N . For each fixed window, the investigator
must decide whether to include all observed haplotypes in the analysis or either to ignore or
pool together rare haplotypes. An analysis of H haplotype categories may then be carried
out in various ways, from a traditional chi-square or exact test of the resulting 2
×
H table
of case - control status and estimated haplotype frequencies, to haplotype trend regression
analysis [44] and analysis of weighted haplotype frequency differences [45].
Haplotype analysis is often used to stand in for direct analysis of an untyped locus
for which the susceptibility allele lies on a specific haplotype background. An alternative
analysis approach with a similar intent is genetic imputation, which infers 'missing' geno-
types at untyped loci. Imputation typically relies on available LD data from a 'reference
population' (e.g. HapMap) between typed and untyped loci. These ' in silico ' genotypes
can then be tested for association with phenotype. The Wellcome Trust GWAS of multiple
Search WWH ::




Custom Search