Mapping complex disease phenotypes (Genomics)

1. Introduction

Complex medical conditions such as obesity and diabetes, and psychiatric disorders such as depression and schizophrenia, are common disabling diseases. Their aetiology involves genetic factors, the environment, and their interaction, with genes typically explaining half or more of the variance. A complete model of causation would include all genetic and environmental factors, and their joint effect on risk of illness. Genetic mapping of complex disease phenotypes focuses on identifying just one component of a complex causal system, the role of genes (see Article 58, Concept of complex trait genetics, Volume 2).

The genetic factors involved in complex disease phenotypes are likely to be risk alleles that are relatively common in the population, but have a modest effect on risk, that is, they exert a weak genetic effect. This is in contrast to high-risk alleles, which have been found only rarely in complex disorders.

For example, more than 55 high-risk alleles for obesity have been identified in a total of six genes, but these are present in less than 2% of the obese population, and most are found in single families (Obesity gene map database; http://obesitygene.pbrc.edu/). Likewise, more than 150 rare high-risk alleles have been identified for Alzheimer’s disease, but their population attributable fraction is only 5%. This is in contrast to the APOE4 gene, which as a common, modest risk factor has a population attributable fraction of 20%. Rare high-risk alleles have been very valuable in unraveling the underlying aetiological pathways in diseases such as obesity and Alzheimer’s disease, but for other diseases such as depression and schizophrenia, such rare high-risk alleles have been elusive.


Common, modest risk alleles are likely to be substantially more important than rare high-risk alleles in efforts to improve our understanding of aetiology and the identification of novel treatments. However, different methods are required for their identification as the tools used for identifying high-risk alleles have not been successfully applied (Cardon and Bell, 2001; Risch, 2000).

2. Linkage analysis

2.1. Linkage in complex phenotypes

The most successful method for identifying genes for human disorders has been linkage analysis, which uses families with the disorder to search for the cosegrega-tion or sharing by affected members of genetic markers (see Article 15, Linkage mapping, Volume 3). It is reliant solely on the genetic analysis of the phenotype, avoiding the need for prior information on pathophysiology and the function of potential risk genes. Linkage analysis has been successful in the identification of genetic loci for many human genetic diseases, and in animal models of disease. It has substantial power for rare, highly penetrant risk alleles such as those in single-gene disorders, however, power is reduced for complex diseases, where risk alleles only have a moderate effect on risk (meaning that allele sharing between affected subjects will be much less evident), and which are influenced by environmental factors. In addition, classical parameters used for mapping Mendelian diseases relating to the mode of inheritance (dominant, recessive) cannot be readily applied as they are unknown for complex disorders (see Article 48, Parametric versus nonparametric and two-point versus multipoint: controversies in gene mapping, Volume 1). To overcome this, two approaches are typically taken to linkage in complex disorders: the nonparametric approach whereby parameters are abandoned and the analysis focused on sharing of alleles between affected family members, usually sibling pairs, or the retention of the likelihood method, with the approximation of parameters in a flexible frame work (Sham, 1997).

Steps in a linkage study involve identifying the phenotype, which can be a dicho-tomous trait such as clinical diagnosis, or a quantitative trait such as body mass index (BMI) or neuroticism; identifying and collecting DNA from the family samples; selecting a genetic marker map and genotyping; and data cleaning and statistical analysis of the data to search for evidence of linkage.

Linkage analysis in complex disorders using dichotomous traits such as disease diagnoses is relatively straightforward. In its simplest form, it measures allele sharing by affected relatives using identity-by-state or identity-by-descent methods, and uses a test statistic to evaluate the significance of deviation from the null hypothesis. A typical linkage marker set consists of 500 or fewer genetic markers spaced throughout the genome, formed into a genetic map; the most commonly used map is the Marshfield map (http://research.marshfieldclinic.org/genetics/Default.htm), but many others are available.

Various study designs can be used to increase power and reduce effort in linkage analyses. These apply particularly to quantitative traits. For example, sibships with the greatest expected contributions to a true LOD peak can be selected for linkage analysis by selecting extreme discordant or concordant sib pairs for the trait under study (Purcell et al., 2001; Carey and Williamson, 1991; Risch and Zhang, 1995). This approach may be very powerful and efficient when mapping quantitative traits in a large population, as genotyping will only be needed in a subset (Nash et al., 2004).

2.2. Data cleaning

After genotyping, data cleaning is used to check the integrity of the data; this is particularly important for testing family structures. For example, sibling pairs may in fact be half-siblings or even unrelated, and the presence of unspecified monozygotic twins could inflate any linkage statistics; programs such as Graphical Relationship Representation (Abecasis et al., 2001) can be used to locate incorrect family relationships by a scatter plot of the mean against the variance of the number of alleles identity-by-state for the typed markers for all pairs of individuals in the sample. Programs such as PREST and ALTERTEST (Sun et al., 2002), which perform multiple tests of a number of possible relationships, can also be used to assess family relationships in linkage samples. Genotyping errors can be checked for by searching for Mendelian incompatibilities (e.g., with the program PEDSTATS) and double recombinants, which are indicative of unlikely genotypes. Genotyping error rates are typically 1% or less.

2.3. Statistical analysis

Many programs are available for complex disease linkage analysis (http://linkage.rockefeller.edu/soft/) and can be used for categorical or quantitative traits. Programs such as GeneHunter (Kruglyak et al., 1996) or Merlin (Abecasis et al., 2002) take a collection of categorical (GeneHunter) or quantitative (GeneHunter or Merlin) trait and genetic marker values, a pedigree, and a marker map. This data is used to perform single-point and multipoint linkage analyses of pedigree data, including parametric, identity by descent (IBD), and nonparametric and variance component linkage analyses. The general output generated is a plot of QTL (quantitative trait loci) location versus LOD score (for parametric analysis) or Z-score (for nonparametric analysis).

Levels of significance can be denoted as a genome-wide p-value, that is, the probability that the observed value will be exceeded anywhere in the genome, assuming the null hypothesis of no linkage. Criteria for linkage in complex disease are a little different from those for single-gene traits (Lander and Kruglyak, 1995; Sawcer et al., 1997), with an LOD score of about 3.3 being regarded as significant evidence of linkage. Lower LOD scores may still represent true positives, and an LOD score of 2 can be regarded as suggestive linkage. Linkage has been regarded as confirmed when a significant linkage observed in one study is confirmed by finding an LOD score or p-value that would be expected to occur 0.01 times by chance in a specific search of the candidate region. However, meta-analysis of linkage data is a more powerful approach for complex disease linkage analysis.

2.4. Meta-analysis

Linkage approaches have been partially successful for complex diseases, and a large body of linkage data has been built up for most complex phenotypes.

In schizophrenia for example, there have been at least 20 genome-wide scans (Lewis et al., 2003), and for BMI more than 30 (Obesity gene map database; http://obesitygene.pbrc.edu/), and these have led to the putative identification of genetic loci identified in more than one study. However, although statistically significant findings have sometimes been supported by subsequent studies, there is a general lack of consistency for most phenotypes in complex disease linkage analysis. Indeed, some genome scans fail to find linkage at all (Altmuller et al., 2001). This could be because susceptibility is conferred by alleles at combinations of loci, each with a small effect on risk, and that the loci of greatest effect vary considerably in their impact between samples, because of geographic variation. False-positive findings are also likely to have arisen, and this could occur more than once for particular loci given the large number of genome scans (Lander and Kruglyak, 1995).

One major issue with linkage is statistical power. Because susceptibility loci for complex diseases are expected to have a small population-wide effect on susceptibility, it is thus difficult to detect their presence consistently without very large samples. For small genetic effect sizes, genome scans with low statistical power tend to overestimate the effect of loci with the highest scores in the scan, that is, maximize the genetic parameters (Goring et al., 2001). Genome scans for complex diseases have typically been underpowered – up to 1000 sibling pairs would be required to reliably demonstrate locus-specific genetic effects causing an approximately 30% increase in risk to siblings (Hauser et al., 1996). Multiplicative genetic effects are even more difficult to detect (Rybicki and Elston, 2000).

One way to overcome the issue of power is to perform meta-analyses of genome scans, for which several strategies are available (Xu and Meyers, 1998; Gu et al., 2001; Zhang et al., 2001; Etzel and Guerra, 2002; Dempfle and Loesgen, 2004). The most robust approach would be to pool the raw data using the original genotypes for each study, construct a combined map of the markers, and perform new linkage analyses, which should find loci consistently (Guerra et al., 1999). In practice, this is not easily done as raw genotype data may not be readily available or restricted by commercial confidentiality. In the absence of raw data, combining significance or effect estimates can provide an overall, but more limited, assessment of different linkage studies. Genuine meta-analyses combine statistics from different studies, and can be divided into those that combine significance tests (e.g., p-values from across studies) and those that combine effect estimates and test the significance of the common effects (Dempfle and Loesgen, 2004).

Combining significance tests can be performed using Fisher’s method for p-values (Guerra et al., 1999), and various modifications are available to combine p-values only below a certain threshold (Zaykin et al., 2002; Olkin and Saner, 2001) for avoiding bias when truncated LOD scores are used (Province, 2001; Wu et al., 2002), and for correcting for multiple testing on the basis of the size of the region-implicated multiple scan probability (MSP; Badner and Goldin, 1999).

Unlike the above methods, the Genome Scan Meta Analysis method (GSMA; Wise et al., 1999; Levinson et al., 2003) was specifically designed for meta-analysis of linkage data, and is a nonparametric rank method that relies on combining effect estimates and testing the significance of the common effects. It requires only placing markers within 30-cM bins and the rank ordering of each bin within and then across studies, allowing the consideration of any linkage test statistic and avoiding the need for the same set of markers to be used. However, GSMA provides no formal test of genetic hetereogeneity, and the interpretation of genome-wide statistical significance currently relies on empirical grounds.

Other nonparametric meta-analyses combine IBD statistics, using the number of alleles common across relative pairs (Gu et al., 1998, Gu et al., 1999). Sample size can also be accounted for (Goldstein et al., 1999). Methods for pooling of quantitative trait data are also available (Zhang et al., 2001), such as combining Haseman-Elston regression coefficients in a random effects model (Etzel and Guerra, 2002).

2.5. QTL mapping complex phenotypes in the mouse

Genetic mapping of complex traits in animals is attractive because of the statistical power and simplicity of the genetics (Flint and Mott, 2001). Linkage analysis of a complex trait in a cross between two inbred strains of mice relies on the fact that there are only three genotypes at a given locus since the parents are homozygous, meaning that, in effect, a test of association is being performed. This is more powerful than the tests of linkage across human families as the variance has a common basis across all the animals tested, and sources of variance as low as 5% can be detected.

Mouse genetic models for complex diseases are very powerful when used for the mapping of QTLs, such as anxiety, hypertension, and adiposity (Abiola et al., 2003). Typically, two inbred strains are crossed to form the F1 progeny, which are then intercrossed to generate an F2 generation or backcrossed to one of the parental strains. Since each progeny chromosome has undergone one meiosis, it will contain about one recombinant per morgan on average, meaning only 3 -4 markers per chromosome need be typed for mapping. The genotype at any locus in the F2 must be homozygous for either parental allele, or heterozygous. For each marker locus, the trait mean is examined for each grouped genotype from the F2 progeny, and tested for statistically significant difference. Markers close to a QTL have similar genotypes to those at the undetected QTL and, consequently, the test at such a marker will be almost equivalent to testing for differences at the QTL.

Using various breeding designs (Silver, 1999), such as F2 crosses, recombinant inbred strains (Plomin et al., 1991; Williams et al., 2001), congenic strains, and chromosome substitution strains (Nadeau et al., 2000; Singer et al., 2004), more than 100 QTL loci in mice have been detected, reflecting the large-scale simple genetic structure of QTL genetic effects (Flint and Mott, 2001). In theory, under such a simple architecture, fine-mapping in order to identify the underlying genetic variants should be possible, using large numbers of mice to generate recombinants around the QTL, despite the fact that the effect of each (~5% of the variance) is weak. However, from these 100+ mouse QTLs, only one actual gene underlying a QTL effect has been isolated (Yalcin et al., 2004), and in plants, only genes for moderate to major QTLs have been identified despite the use of thousands of crosses for mapping. The reason for this is the hidden complexity of QTLs; each locus detectable by mapping may not map to a single gene, but a group of QTL “increaser” and “decreaser” alleles that lie within a cluster of genes covering a large (up to 30 cM) region (Darvasi and Soller, 1997; Legare et al., 2000; Flint and Mott, 2001). Furthermore, loci can interact synergistically (epistasis), an effect that cannot easily be detected by QTL methods. As a result of these factors, methods such as recombination mapping and the use of congenic strains may fail to identify the underlying QTL.

Methods to overcome this have focused on intermating strategies to break down linkage and increase mapping resolution, particularly those that use outbred stocks of mice to create advanced intercrosses (Talbot et al., 1999; Mott et al., 2000). Thus, a mapping resolution for a QTL of less than 1 cM has been achieved with genetically heterogeneous HS mice, for which each chromosome is a fine-grained mosaic of the eight founder chromosomes that make up the stock.

With the optimum approach, it is possible to perform fine-mapping to identify at least a group of candidate genes; however, the final problem is the identification of which gene harbors the QTL. Mapping studies are aimed at identifying DNA polymorphisms that alter the trait of interest, and a functional polymorphism can lie anywhere within or near a gene; for example, enhancers tens of kilobases away from the coding part of a gene are known, so the location of the QTL allele may not necessarily implicate a particular gene. Furthermore, there may be hundreds of neutral polymorphisms within the region of interest, and it is currently difficult and laborious to use bioinformatic and functional genomic analysis to tell which is a QTL allele and which is not. However, methods such as transgene complementation can help identify which gene is involved, if not which polymorphism (Flint and Mott, 2001; De Luca et al., 2003; Yalcin et al., 2004).

3. Fine-mapping in humans

While it has been possible to map genes that have a large phenotypic effect and can thus be localized by the use of recombinants, the reduced penetrance in complex diseases means that recombination events cannot be used to reliably map the position of susceptibility alleles. Statistical approaches based on analysis of recom-binants are also not reliable because of the small numbers that occur within each family. Thus, in almost all cases, it has not proved possible to identify complex disease genes by linkage mapping alone.

After the initial genome scan, a linked region will typically be refined with additional microsatellite markers to drain the region of any residual informativeness for linkage. Although this may increase the LOD score of the region, it may only sharpen the linkage peak a little. Further efforts at refining linkage peaks, such as ordered subset analysis, may be used (Hauser et al., 2004). However, linkage will leave a candidate locus of as much as 10 cM, which will contain on average 80 genes.

3.1. Positional candidate genes: a mapping shortcut

If the region is reasonably well defined and the pathophysiology of the disease fairly clear, this may allow the selection of candidate genes based on position and function, which can be directly evaluated for their contribution to the trait under test by association analysis. While this is a strong approach for diseases with specific tissue or cellular localizations and characteristic pathology, such as eye disease or diabetes, this approach has been less successful for most diseases, including psychiatric diseases such as schizophrenia or depression, where information on pathophysiology is poor.

Many techniques can be used to identify strong candidate genes from within linked regions. These include data mining techniques that take advantage of the increasing level of knowledge on gene function. For example, Perez-Iratxeta et al. (2002), Perez et al. (2004), and others use systematic annotation of genes with controlled vocabularies to develop a scoring system for the possible functional relationships of human genes to genetically inherited diseases, including complex diseases. The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) (Camon et al., 2004) provides high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL, and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO; see Article 82, The Gene Ontology project, Volume 8), allowing functional assessment of many genes. The goal of the GO project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing (Ashburner et al., 2000; Lewis, 2005).

Other methods can also be used in the gene identification process, such as transcriptomics (reviewed in Farrall, 2004) and proteomics (Jaffe et al., 2004; see also Article 94, Expression and localization of proteins in mammalian cells, Volume 4); in disease mapping, these approaches can annotate gene databases with useful functional information and can be used to attempt to identify changes in protein or mRNA levels, distribution or function that can be used to implicate genes in the disease process. These approaches are largely unproven at present, as many of these changes can be secondary to the disease process or confounded by factors such as medication.

4. Positional cloning by linkage disequilibrium: fine-mapping

Fine-mapping strategies focus on systematically searching for genetic markers from within a linkage locus that are associated with the disease or trait in question, and can also be applied to genome-wide analysis (see below). In the human genome, there are more than 6 million common (minor allele >0.1) SNPs in about 3.2 billion bp (Kruglyak and Nickerson, 2001), plus more than 500 000 VNTRs (variable number of tandem repeats). This translates to about 1 SNP every 500 and 1 VNTR every 6000 or so base pairs, equivalent to tens of thousands of potential disease-susceptibility polymorphisms in any complex disease linkage locus.

This high density of markers provides a problem for fine-mapping studies, as there are a large number of potential susceptibility alleles within any given locus. However, the existence of linkage disequilibrium (LD; also known as allelic association) means that these markers are not independent of each other, and it is possible to infer the location of a disease-susceptibility allele without actually genotyping it (Weiss and Clark, 2002; see also Article 17, Linkage disequilibrium and whole-genome association studies, Volume 3). Thus, if a particular genetic marker is in LD with a disease or trait susceptibility allele, the marker will also be in LD with the disease or trait. LD in the human genome, at least in non-African populations, is higher than what had been expected, making the LD approach highly promising for mapping studies. However, intervals displaying association may be relatively wide and hence contain many genes, especially in admixed or isolated populations, a finding borne out by the analysis of QTLs in the mouse (Flint and Mott, 2001).

LD is present when recombination between alleles is rare, because they are physically close together on the same chromosome. Thus, instead of the alleles of two adjacent markers being randomly distributed with respect to each other, as they would be if they occurred on separate chromosomes (or indeed far apart on the same chromosome), their distribution becomes nonrandom and the alleles exhibit LD. This also means that there are a limited number of haplotypes in any given region, reducing genetic complexity (Boehnke, 2000).

4.1. Measurement of LD

LD is an unpredictable measure; unlike linkage, which is hierarchical, physical or genetic distance cannot be used to predict LD between markers. Markers only a few hundred base pairs apart may be in weak LD, whereas markers separated by hundreds of kilobases may be in very strong LD. Consequently, LD must be measured experimentally.

Measurements of LD typically capture the strength of association between pairs of biallelic sites (pairwise LD), and are usually measured using the statistics D’ (Lewontin, 1964) or R2 (sometimes denoted A2; Devlin and Risch, 1995) (see Wall and Pritchard, 2003 for review). Both are normalized statistics, that is, they measure the range from 0 (no LD) to 1 (complete LD), but their interpretation is different. D’ is defined as equal to 1 if just two or three of the possible four haplotypes of a pair of biallelic markers are present. Intermediate values of D’, where all four haplotypes are present, are variable and difficult to interpret (Hudson, 1985; Hudson, 2001; Pritchard and Przeworski, 2001). In contrast, the R2 metric only reaches 1 if only two of the four haplotypes are present, that is, each allele is completely associated with just one other. It has a simpler inverse relationship with the sample size required to detect association with susceptibility loci. Thus, to detect a susceptibility allele using a nearby genetic marker in LD with it, the sample size needs to be increased by a factor of 1/R2 in comparison to examining the susceptibility polymorphism directly.

Other measures that examine LD across regions rather than just pairwise are possible, such as the measure p measures how much recombination would be required under a particular population model to generate the observed LD (Wall and Pritchard, 2003). Metric maps of LD in the human genome are also being created (Maniatis et al., 2002; Tapper et al., 2003) on the basis of LD units rather than positions in kilobases.

4.2. Fine-mapping strategies: subjects

Families used for linkage analysis are not likely to have sufficient power for fine mapping based on LD, and most investigators will collect a population sample of cases and controls, or nuclear families, for the mapping process (see Article 51, Choices in gene mapping: populations and family structures, Volume 1 and Article 60, Population selection in complex disease gene mapping, Volume 2). There has been considerable debate on which are the most effective study samples for complex gene-mapping efforts (see Article 51, Choices in gene mapping: populations and family structures, Volume 1 and Article 60, Population selection in complex disease gene mapping, Volume 2).

Unrelated individuals have most often been used for LD mapping and association studies, mainly because of the simplicity of design and the ease of collecting samples, and the advantages of family-based analysis are in general not thought to be substantial (Morton and Collins, 1998). The main consideration for case-control association studies is population stratification (see Article 75, Avoiding stratification in association studies, Volume 4), as allele frequencies vary substantially between different human populations (a population is defined as stratified if it consists of more than two ethnic groups each with differing population allele frequencies). In effect, this results in poor case-control matching and false-positive (or sometimes false-negative) association results.

Using individual-specific inferred haplotypes as covariates in standard epidemiologic analyses (e.g., conditional logistic regression) is an attractive analysis strategy, as it allows adjustment for nongenetic covariates, provides haplotype-specific tests of association, and can estimate haplotype and haplotype x environment interaction effects (Kraft et al., 2005). Several methods, including the most likely haplotype assignment and the expectation substitution approach (Schaid, 2004; Zaykin et al., 2002; Stram et al., 2003) are available.

A variety of methods for the use of genomic controls to avoid stratification bias has been proposed,which can detect and control for population stratification in genetic case-control studies (Devlin and Roeder, 1999; Devlin et al., 2001; Reich and Goldstein, 2001). A combined approach of careful ethnic assessment of study populations, because of the strong correspondence between genetic structure and self-reported race/ethnicity categories (Tang et al., 2005) combined with genomic control approaches may be the most efficient (Lee, 2004). Genotyping error can be minimized experimentally, for example, by using two different methods for genotyping the same sample; and using samples such as duplicates and identical twins to measure error rates. While this will increase costs, there may be significant enhancement in the ability to detect association, especially when the number and complexity of haplotypes is high (Kirk and Cardon, 2002).

Family-based association studies, such as the case-parents and the case-sibling designs (Risch and Merikangas, 1996), gained popularity for disease mapping since they avoid the problems of case-control matching though making marker comparisons are between members of the same family (Ewens and Spielman, 1995). However, theoretical and empirical study on the degree of population stratification bias in non-Hispanic European populations found the bias to be minimal (Wacholder et al., 2000). The use of nuclear families in association does not offer great advantages over case-control analysis for the detection of genotyping errors, particularly as there is no inheritance test for the nontransmitted alleles used as controls in family-based analysis. However, since phase information is available, family-based haplotype tests may be particularly useful in mapping studies (Lange and Boehnke, 2004; Lin et al., 2004).

Certain populations may have advantages for genetic mapping; for example, LD intervals reach up to 1 Mb in general alleles of young subisolates, which may provide advantages for the initial locus positioning of complex traits (Varilo and Peltonen, 2004). Observations on LD parameters indicate that Eurasian populations (especially isolates with numerous cases) are efficient for genome scans, and populations of recent African origin (such as African-Americans) are efficient for identification of causal polymorphisms within a candidate sequence, since LD is lower (Lonjou et al., 2003). The main disadvantage of small isolates is statistical power; it may not be possible to obtain a large enough population for mapping studies and for the same reason opportunities for replication in the same population may be limited.

4.3. Fine-mapping strategies

Attempts to localize complex disease-susceptibility genes have focused on methods aimed at detecting LD between individual genetic markers, their haplotypes, and putative disease-susceptibility loci, and is already in use in complex disorders. The first applications were to major loci that could be assigned to haplotypes by family study (Kerem et al., 1989; Devlin and Risch, 1995; Terwilliger, 1995). These and other studies have provided the foundation for the application of LD mapping for positional cloning of common diseases in complex inheritance. A 10-cM region displaying linkage with a disease will contain about 20 000 SNPs, assuming its physical size is 10 megabases. To fine-map a 10-cM linked region with individual SNPs, about 3000 individual markers would be required, based on calculations used to estimate the number of SNPs required for a whole-genome scan (Carlson et al., 2003).

4.4. Mapping using haplotypes

Historically, association tests were limited to single variants, so that the allele was considered the basic unit for association testing. However, the use of haplotypes, or haploid genotypes, has become increasingly popular (see Article 12, Haplotype mapping, Volume 3). Many haplotype analysis methods require phase (i.e., family transmission) information inferred from genotype data. However, as the number of loci increases, the information loss due to haplotype ambiguity increases rapidly (Hoh and Hodge, 2000). Several strategies involving the expectation-maximization (EM) algorithm (Ott, 1977; Slatkin and Excoffier, 1996) have been proposed to overcome the problem of missing phase information for estimating haplotype frequencies (Excoffier and Slatkin, 1998; Hawley and Kidd, 1995; Chiano and Clayton, 1998). In general, EM estimation of haplotype frequencies for multiple genotypes is a better strategy for the recruitment of family members or intensive laboratory haplotyping for haplotype-based genetic studies. The availability of population-based haplotype databases will simplify this process further. However, for most methods it is necessary either to discard families with ambiguous haplotypes or analyze the markers separately, resulting in potential loss of power (Cheng et al., 2003). For haplotype analysis, a frequency threshold for the inclusion of haplotypes (usually >3%) can be set to protect against misleading results due to rare alleles or haplotypes.

Methods for haplotype analysis of regions have focused on moving window analysis, in which a scan of sets of tightly linked SNPs is made across the region of interest, in order to identify the site of LD with the trait under test. This can require assigning a window width first and then analyzing multisite parental transmission data under this fixed width (Clayton, 1999; Zhao et al., 2000). Other procedures that can maximize LD with an appropriate window width of haplotype transmission data within a preset range have also been proposed (Cheng et al., 2003).

The analysis of LD in the human genome has led to the proposed use of haplotype-map-based LD approaches to mapping genes (Cardon and Abecasis, 2003; Wall and Pritchard, 2003). This arose from the observation that LD in the human genome appears to consist of “haplotype blocks”, stretches of DNA where strong LD exists between markers, punctuated by areas of weak LD where recombination rates are much higher (Jeffreys et al., 2001; Daly et al., 2001; Patil et al., 2001). These blocks extend for <10 to more than 100 Kb, and have low haplotype diversity, meaning that, in theory, relatively few SNPs could be used to describe the haplotypes of a given region. Haplotype tagging is potentially an efficient way of mapping linkage loci, since it should be possible to use a small set of “htSNPs” to analyze each block in a linked region, reducing the effort by 75% or more (Johnson et al., 2001; Patil et al., 2001).

Under this model, mapping a linked locus of 10-20 cM would involve dividing the region into 100-200 blocks for analysis, each of which is tagged by a finite number of htSNPs (perhaps 5-10). Thus, a single large region could be fine-mapped by 500-1000 htSNPs. Various methods have been proposed for the selection of htSNPs (reviewed in Lin and Altman, 2004), including manual selection for small genomic regions (Daly et al., 2001; Johnson et al., 2001), systematic evaluation of subsets with a metric to evaluate each candidate set (Patil et al., 2001), analysis of all pairwise comparisons to select the htSNPs explaining the most haplotype diversity (Daly et al., 2001), entropy (Judson et al., 2002; Avi-Itzhak et al., 2003; Hampe et al., 2003), or minimization of the squared correlation between the estimated and the true number of copies of haplotypes (Chapman et al., 2003; Stram et al., 2003). Alternative approaches, which measure how well individual and sets of SNPs predict one another (Bafna et al., 2003), are based on set theory and recursive searches for the minimal set of SNPs from which the maximum number of the other SNPs in the data set can be derived (Sebastiani et al., 2003), or clustering SNPs by pairwise LD measures and then selecting one htSNP per cluster (Carlson et al., 2004). More recently, principal component analysis (PCA) has been proposed as an efficient method, and evidence suggests that it tends to select the smallest set of htSNPs to achieve a 90% reconstruction precision (Meng et al., 2003; Lin and Altman, 2004).

While good computational methods are available for efficient analysis of haplotype blocks, it is not yet clear how well defined they are in real populations and whether they are stable across diverse ethnic groups (van den Oord and

Neale, 2004). Some studies indicate that haplotype blocks are stable across diverse populations (Gabriel et al., 2002; Dawson et al., 2002), which would allow the generation of general human LD maps, analogous to the recombination maps created for linkage analysis (Maniatis et al., 2002). However, this model of the LD structure of the human genome may be excessively simplistic. For example, there is evidence that much of the genome may not be formed into haplotype blocks (Phillips et al., 2003), and that where blocks exist they are not necessarily discreet entities because of long range LD (Daly et al., 2001; Jeffreys et al., 2001; van den Oord and Neale, 2004).

The remaining challenge is then to refine the techniques for fine-mapping of the causal polymorphism(s) within regions of high LD. Obviously, if strong blocks exist, then it will require a combination of genetic and molecular biological methods to identify which of the SNPs within the block are causal. Methods based on single marker tests within a composite likelihood framework (Maniatis et al., 2005) can apply a model with evolutionary theory that incorporates a parameter for the location of the causal polymorphism.

4.5. Positional cloning by linkage disequilibrium – genome-wide approaches

The success of linkage analysis in identifying genes for single-gene disorders reflects the fact that it has substantial power to identify rare high-risk disease alleles, since IBD allele sharing for this type of genetic risk factor will be very high between affected individuals in pedigrees. However, for modest risk alleles, such as those operating in complex disorders, allele sharing between affected subjects will be much less evident (Carlson et al., 2004); for example, it has been estimated that samples of 600-1000 affected sibling pairs would be required to reliably demonstrate locus-specific genetic effects causing a 27-30% population-wide increase in risk to siblings (Hauser et al., 1996); typical sample sizes for complex disease linkage analysis are in the low hundreds.

Consequently, the most accepted route to mapping complex disease genes, linkage followed by LD mapping or positional candidate gene analysis, may miss loci that have modest effects on risk unless very large sample sizes or meta-analyses are employed. Association analysis, at least theoretically, has more power to detect common disease alleles that confer modest risk (Risch and Merikangas, 1996). It is also easier to attain high statistical power as larger numbers of case-control or family trio samples are available than multiply affected families, which can be elusive. However, because there are many more generations from the most recent common ancestor in an association sample, much higher marker densities are needed to detect association compared to linkage (Kruglyak, 1999; Gabriel et al., 2002).

Thus, whole-genome association has been proposed as an alternative approach to linkage followed by fine-mapping, in which a dense genome-wide set of SNPs is tested for disease association under the assumption that if a risk polymorphism exists, it will either be genotyped directly or be in strong LD with one of the genotyped SNPs. This is essentially a single-point approach, where markers are analyzed one by one, but other approaches such as “moving window” haplotype analysis or multipoint haplotype analysis are also possible (Morris et al., 2003). Family samples can also be used for genome-wide analysis. Lin et al. (2004) describe an algorithm and a statistical method that efficiently and exhaustively exploits haplotype information from sliding windows of all sizes to transmission disequilibrium tests, which can detect both common and rare disease variants of small effect.

Estimates of the number of SNPs required for whole-genome association vary, but a requirement for between 300 000 and 1.3 million SNPs has been suggested on the basis that strong LD operates over short distances, and empirical analysis of specific region (Kruglyak, 1999; Gabriel et al., 2002; Carlson et al., 2003). Both direct (gene-based) and indirect (neutral marker) mapping approaches have been suggested. In the indirect approach, a dense set of neutral SNPs is used in the hope of detecting LD with causative SNPs, whereas in the direct approach, each of the 25 000 or so human genes would be analyzed by a set of representative SNPs in each gene, with an attempt to focus on potentially functional SNPs (Kruglyak, 1999; Botstein and Risch, 2003; Neale and Sham, 2004). A focus on analyzing SNPs that alter coding regions (cSNPs) has been proposed, and there is some evidence that some complex disease causing polymorphisms will be cSNPs; however, it is at least as likely that complex disease risk alleles will lie in noncoding regulatory regions of genes, as seen for nonhuman QTLs (Yalcin et al., 2004; Flint and Mott, 2001) such as promoters, where it is difficult and laborious to assess the functionality (Buckland et al., 2004).

Neale and Sham (2004) have proposed a shift toward a gene-based approach in which all common variants within a gene are considered jointly, thus capturing all potential susceptibility alleles. This has advantages for the consideration of genetic differences in population, which are more readily resolved by use of a gene-based approach rather than either a neutral SNP-based or a haplotype-based approach. Thus, negative findings are subject only to the issue of power. Once all gene variants are characterized, the gene-based approach may become the natural end point for association analysis and will inform our search for functional variants relevant to disease aetiology.

Whole-genome approaches are dependent on assembling an adequate SNP marker map, for which difficulty in selecting SNPs comes principally from ethnic variation in LD and in SNP frequency. Carlson et al. (2003) estimate that even analyzing all 2.6 million SNPs known in 2003, only 80% of all common SNPs in European and 50% in African-origin populations would be detected; as many as a quarter of all SNPs seen in African populations are “private”, that is, they do not exist elsewhere, meaning that SNP marker maps may need to be population specific and very dense. Improvements in efficiency may be achieved by forming SNPs into haplotypes and haplotype blocks that can be tagged with htSNPS, as proposed for locus-specific fine-mapping. However, this will require the completion of the human haplotype map (The International HapMap Consortium, 2003).

Microsatellite markers have also been proposed for whole-genome association. Ohashi and Tokunaga (2003) calculated a markedly higher power for microsatellite markers than for SNPs, even if more SNPs are analyzed, suggesting that the use of microsatellite markers is preferable to the use of SNPs for genome-wide screening under certain assumptions. This method will be helpful for researchers who design genome-wide LD testing with microsatellite markers.

DNA pooling (Barcellos et al., 1997) has been proposed as a method to economize on the number of genotypes required for whole-genome association studies. Pooled genotyping is a powerful and efficient tool for high-throughput association analysis, both case-control and family-based. The use of pooling designs can significantly reduce the costs of a study. At the same time, since it is also extremely efficient with DNA resources, pooling can be an extremely effective method for conserving DNA (Sham et al., 2002; Norton et al., 2004). Sophisticated pooling designs are being developed that can take account of hidden population stratification, confounders, and interactions, and that allow the analysis of haplotypes (Hoh et al., 2003). Both microsatellites (Daniels et al., 1998; Breen et al., 1999) and SNPs (Breen et al., 2000) are amenable to the pooling approach. Pooling approaches that allow chip-based genotyping may be particularly cost-efficient and rapid.

Whole-genome analysis also raises the statistical problem of multiple testing and levels of significance when using hundreds of thousands of genetic markers (Carlson et al., 2004), requiring substantial Bonferroni correction. There are no obvious methods that can exclude false-positives while capturing true positives for genetic effects that are weak, aside from repeated replication and meta-analysis, which may be subject to bias and are not a substitute for an adequately powered primary study (Munafo and Flint, 2004). It may be possible to overcome this problem using very large samples sizes (e.g., 2000 cases and 2000 controls), and practical methods have been proposed to help identify important genetic factors more efficiently, such as ranking markers by proximity to candidate genes or by expected functional consequence. Single marker tests within a composite likelihood framework have also been proposed, and will avoid heavy Bonferroni correction (Maniatis et al., 2005).

Next post:

Previous post: