Linkage disequilibrium and whole-genome association studies (Genomics)

1. Introduction

Complex diseases are those that involve multiple genetic loci as well as environmental or lifestyle effects (see Article 57, Genetics of complex diseases: lessons from type 2 diabetes, Volume 2). Such diseases often affect a substantial proportion of the population. Uncovering the genetic components of such diseases is a current challenge for human genetics.

2. Complex diseases and association studies

As an example, let us consider the situation for one complex disease, breast cancer. It has been estimated that less than 2% of all breast cancer cases are caused by rare inherited mutations in two genes called BRCA1 and BRCA2 and that these genes account for only 20% of the excess familial risk of the disease (Anglian Breast Cancer Study Group, 2000). Clearly, other breast cancer susceptibility genes remain to be identified, especially for the sporadic form of the disease that is rarely the result of mutations in the BRCA genes (Gayther et al., 1998). Segregation analysis suggests that there could be many different breast cancer loci, each contributing a small effect (Antoniou et al., 2001).

The association study is a recent method that can be used to identify complex disease genes (for review see Cardon and Bell, 2001). This method compares the frequency of genetic variants in unrelated cases (who have a given disease) and controls (who are free of disease) to identify variants or regions that are putatively involved in disease etiology. If such variants are associated with disease, further characterization is necessary to demonstrate a causal role in the disease process. Association studies can use a candidate gene approach to investigate polygenic disorders; the choice of candidates is based on previous biological and/or genetic insights into that disease. BRCA1 and BRCA2 for example, have roles in mammalian DNA double-strand break (DSB) repair, and this has provided a rationale for breast cancer association studies that have focused on numerous members of this molecular pathway. Variants and/or haplotypes of BRCA2, XRCC2, XRCC3, and Ligase 4 have all been associated with modest risks of breast cancer and population-attributable risks (the proportion of a population’s breast cancer that is due to a particular genetic variant) of up to 2% (Healey et al., 2000; Kushel et al., 2002). Although candidate gene approaches have successfully identified low-risk variants for breast cancer susceptibility, this approach is only just beginning to address the genetics of this complex disease.


3. Whole-genome association and linkage disequilibrium

Whole-genome association studies, in contrast to candidate gene-based studies, do not require existing knowledge of the relevance of specific genes, pathways, or biological hypotheses in order to identify the genetic determinants of disease. Whole-genome scans use genetic markers of at least moderate allele frequency distributed across the genome. There are currently 5 million validated SNPs distributed across the 3 billion base pairs of human sequence (single-nucleotide polymorphism database – dbSNP build 123 – www.ncbi.nlm.nih.gov/SNP), and directed resequencing efforts show that more SNPs exist than are currently in the public domain (Carlson etal., 2003). Is it necessary to include all of this genetic variation in a whole-genome scan for genetic association? Fortunately, the genetic phenomenon of linkage disequilibrium (LD) reduces the number of variants necessary for a whole-genome association study. LD, otherwise known as nonrandom association of alleles, can be used to correlate genetic variation with phenotypic traits (see Article 73, Creating LD maps of the genome, Volume 4). LD between alleles of physically linked markers is an indication of their recombination history in the population, and can be affected by numerous contributing factors such as recombination rate, mutation age, genetic drift, ethnic diversity and natural selection. LD can vary significantly within and between different populations, in particular, Europeans show greater LD than African populations (for review, see Ardlie etal., 2002). Furthermore, LD varies between and across whole chromosomes (Reich et al., 2001; Patil et al., 2001; Dawson et al., 2002). Many studies suggest that the human genome is organized into haplotype blocks that show high LD, interspersed with shorter regions of high recombination and consequently low LD (Gabriel et al., 2002; Ardlie et al., 2002 and references therein). Certainly, chromosomes 21 and 22 both show this blocklike LD structure (Patil et al., 2001; Dawson etal., 2002). Common haplotypes can represent most of the genetic variation across relatively large regions of the genome. These haplotypes (including the known and unknown variation) can be genotyped by using a small number of “haplotype tagging” SNPs (htSNPS) that suffice to specify all reasonably common haplotypes in the population of interest (see Article 12, Haplotype mapping, Volume 3). Thus, LD, in the form of haplotypes, can be used to reduce the number of SNPs needed to genotype a particular genomic region or the entire genome. An international collaborative effort, the HapMap project, is underway to determine the size and boundaries of the human haplotype blocks. This project is now midway through the typing of 600 000 SNPs (on average 1 every 5000 bp) in each of the three populations (International HapMap Consortium, 2003; Couzin, 2004). It has already become clear that this number of SNPs will be insufficient to produce a refined haplotype map representative of all populations (Gabriel et al., 2002).

The number of markers necessary to conduct a whole-genome scan for association will be a function of the average size of a haplotype block in the human genome and the number of markers necessary per block to specify all reasonably common haplotypes in populations of interest. Estimates of the number of markers required range from 100 000 to 1 million SNPs (Gabriel et al., 2002; International HapMap Consortium, 2003; Carlson et al., 2003). Until the completion of the HapMap, the best current prediction of the average size of an LD block for European populations is in the range of 10-30 kb; blocks in African populations are generally smaller (Gabriel et al., 2002; Ardlie et al., 2002 and references therein). The feasibility of whole-genome scanning for association will also depend upon the lower limit of odds ratio (OR) that is desirable to identify for a given disorder. Major genetic effects can be detected using smaller case/control groups; subtle effects such as a doubling of risk (OR = 2) require larger sample sizes. The sample size used for a study thus determines whether it will simply skim off the larger genetic effects, neglecting smaller ones, or whether it will be a more thorough assessment of the genome in terms of both major and subtle genetic risk factors. The optimal sample size required for a meaningful whole-genome scan is also impacted by statistical corrections required to adjust for multiple testing.

4. Gene-environment interactions

A more comprehensive understanding of the causes of complex diseases will also depend on studies that incorporate gene-environment interaction. Such studies require both accurate environmental and/or lifestyle data for the same group of individuals that are characterized genetically, thus necessitating even larger sample sizes than purely genetic studies. The sample sizes of association studies are limited by issues such as the cost of phenotypic characterization of cases and controls for a given disorder, which can vary greatly between diseases.

5. Technology and cost

For whole-genome scans for association, cost will be a key consideration. Let us assume that a comprehensive genome scan is likely to involve approximately 500 000 markers and that such an experiment will include at least 1000 samples. The number of genotypes required is hence on the order of 5 x 108 genotypes. If costs were only 1 cent per genotype (this has yet to be achieved routinely), the hypothetical genome scan above would cost $5 million to complete. This figure is unrealistic for all but the largest research groups and, therefore, the per-genotype costs would have to be reduced severalfold to fit into the budgets of most laboratories.

The high cost of genome scanning could be decreased by two means, (1) the use of DNA pools rather than individual samples and (2) the use of very high orders of multiplexing or parallel genotyping of SNPs in individual DNA samples (see Article 77, Genotyping technology: the present and the future, Volume 4). DNA pooling involves the mixing of precisely equal quantities of individual DNAs to form, for example, “case” and “control” pools, followed by a genotyping procedure that can determine the allele frequency of each pool at each SNP tested. For analysis of DNA pools, the genotyping procedure used must be quantitative and as sensitive as possible. For complex diseases, it is likely that large numbers of genetic factors, many with subtle effects, will combine to produce disease susceptibility. Genotyping methods that are not sufficiently quantitative or sensitive to detect small differences in allele frequencies between pools would likely be inadequate to dissect many of the genetic factors underlying common complex diseases. To date, few methodologies have been shown, in peer-reviewed publications, to be truly quantitative. Two such methods are the MassARRAY system (Sequenom, Inc.) and pyrosequencing (Biotage AB), which have been shown to quantitatively measure differences in allele frequencies below 2% (Bansal etal., 2002; Herbon etal., 2003; Gruber et al., 2002). While pooling offers a reduction in the number of DNAs to be genotyped, some information is also lost, as differences within a pool can no longer be analyzed. In particular, it is less powerful than individual genotyping when known risk factors (such as smoking, age, sex) are being considered for each sample (Carlson etal., 2004). One way to counter this loss is to sort the samples into subpools on the basis of their respective risk factors. This will, however, increase the number of assays per marker, which is contrary to the rationale for pooling in the first instance (Carlson et al., 2004). Working from our earlier assumption of 500 000 markers and 1000 samples, a genome scan involving two DNA pools, analyzed in triplicate, would need to have a pool-genotype cost of $0.33 or less to give a total cost of $1 million (corresponding to a fivefold overall reduction in cost compared to individual sample genotyping). In addition, such technologies would need to be accompanied by a ready set of assays corresponding to a suitably dense series of markers.

Other technology developers have focused on highly parallel or highly multiplexed genotyping of individual samples using techniques that need not be precisely quantitative, but are capable of reliably distinguishing heterozygotes. Such technologies include capillary electrophoresis-based methods such as the Applied Biosystem SNPlexâ„¢ system (48-plex), the Illumina BeadArray (1536-plex) and chip array-based methods such as those of ParAllele (10 000 nonsynonymous SNPs), Affymetrix (currently 100 000 SNPs per chip), or Perlegen (up to 1.5 million SNPs). Under our assumption of 500 000 SNPs and 1000 samples, the cost of genotyping individual samples using these methods would need to be approximately 0.2 cents per genotype to bring the cost of such a study to $1 million and be attractive to a wide variety of laboratories. The cost of several high-throughput methods is now on the order of cents per genotype (see Article 77, Genotyping technology: the present and the future, Volume 4). Until this is reduced further, however, whole-genome scans for association will remain in the domain of an exclusive few laboratories or companies with the resources to cover the current costs.

6. What is the current status of human genome scans?

Various corporate groups refer to unpublished genome scans for association. Academic groups, in contrast, have published several papers reporting “genome scans” for association using only a few thousand markers that are clearly inadequate in number to be considered a representative of the entire human genome. While both types of report represent some progress, it seems that genome scans for association at the present time remain unproven.

In the meantime, as the HapMap moves toward completion and commercial groups vie aggressively to produce faster and cheaper genotyping methods, academic researchers continue to carry out hypothesis-driven candidate gene studies; basing their intelligent guesses on the current understanding of human disease biology. In the future, comparing these results to those of whole-genome scans for association may tell us how much – or how little – we understand about our own genome.

Next post:

Previous post: