Advertisement
Science in Medicine Free access | 10.1172/JCI34772
National Human Genome Research Institute, Bethesda, Maryland, USA.
Address correspondence to: Teri Manolio, National Human Genome Research Institute, 31 Center Drive, Room 4B-09, Bethesda, Maryland 20892-2154, USA. Phone: (301) 402-2915; Fax: (301) 402-0837; E-mail: manolio@nih.gov.
Find articles by Manolio, T. in: JCI | PubMed | Google Scholar
National Human Genome Research Institute, Bethesda, Maryland, USA.
Address correspondence to: Teri Manolio, National Human Genome Research Institute, 31 Center Drive, Room 4B-09, Bethesda, Maryland 20892-2154, USA. Phone: (301) 402-2915; Fax: (301) 402-0837; E-mail: manolio@nih.gov.
Find articles by Brooks, L. in: JCI | PubMed | Google Scholar
National Human Genome Research Institute, Bethesda, Maryland, USA.
Address correspondence to: Teri Manolio, National Human Genome Research Institute, 31 Center Drive, Room 4B-09, Bethesda, Maryland 20892-2154, USA. Phone: (301) 402-2915; Fax: (301) 402-0837; E-mail: manolio@nih.gov.
Find articles by Collins, F. in: JCI | PubMed | Google Scholar
Published May 1, 2008 - More info
The International HapMap Project was designed to create a genome-wide database of patterns of human genetic variation, with the expectation that these patterns would be useful for genetic association studies of common diseases. This expectation has been amply fulfilled with just the initial output of genome-wide association studies, identifying nearly 100 loci for nearly 40 common diseases and traits. These associations provided new insights into pathophysiology, suggesting previously unsuspected etiologic pathways for common diseases that will be of use in identifying new therapeutic targets and developing targeted interventions based on genetically defined risk. In addition, HapMap-based discoveries have shed new light on the impact of evolutionary pressures on the human genome, suggesting multiple loci important for adapting to disease-causing pathogens and new environments. In this review we examine the origin, development, and current status of the HapMap; its prospects for continued evolution; and its current and potential future impact on biomedical science.
The International HapMap Project was designed to create a public, genome-wide database of patterns of common human sequence variation to guide genetic studies of human health and disease (1–3). With the publication of the draft human genome sequence in 2001 (4) and the essentially finished version in 2003 (5), the HapMap emerged as a logical next step in characterizing human genomic variation, particularly of the millions of common single–base pair differences among individuals, or SNPs (see Glossary). The HapMap was designed to determine the frequencies and patterns of association among roughly 3 million common SNPs in four populations, for use in genetic association studies.
The HapMap has introduced a new paradigm into genomic research, primarily in the form of genome-wide association (GWA) studies, by making possible the cost-efficient assessment of much of the common genomic variation within an individual (1, 6). It has also provided new insights into evolutionary pressures on the human genome and has facilitated functional investigation and cross-population comparisons of candidate disease genes. In addition, it has led to important methodologic advances in imputation of untyped SNPs (that is, reliable estimation of genotypes at SNPs not typed on existing genotyping platforms based on information from typed SNPs) and in assessment of population substructure in genetic association studies. Finally, the open availability of HapMap samples (both DNA and cell lines) and the consent and consultation process through which they were collected have provided a valuable resource for continued development of genomic research methods, such as association studies of gene expression and other cellular phenotypes, and for quality assessment of genotyping data.
Most common diseases are caused by the interplay of genes and environment, with adverse environmental exposures acting on a genetically susceptible individual to produce disease (7, 8). Unlike Mendelian disorders such as sickle cell disease and cystic fibrosis, in which alterations in a single gene explain all or nearly all occurrences of disease, genes underlying common diseases are likely to be multiple, each with a relatively small effect, but act in concert or with environmental influences to lead to clinical disease (Figure 1) (9).
Genetic and environmental contributions to monogenic and complex disorders. (A) Monogenic disease. A variant in a single gene is the primary determinant of a monogenic disease or trait, responsible for most of the disease risk or trait variation (dark blue sector), with possible minor contributions of modifier genes (yellow sectors) or environment (light blue sector). (B) Complex disease. Many variants of small effect (yellow sectors) contribute to disease risk or trait variation, along with many environmental factors (blue sector).
Identifying these genetic influences would be quite difficult if the risk-associated allelic variants at a particular disease-causing locus were very rare, so that for a disease to be common there would be many different causative alleles. In contrast, the HapMap was designed to identify more common disease-causing variants based upon the “common disease, common variant” hypothesis, which suggests that genetic influences on many common diseases are attributable to a limited number of allelic variants (one or a few at each major disease locus) that are present in more than 1%–5% of the population (10–12). Evidence supporting this hypothesis was modest at the outset of the HapMap Project, and reliance on the hypothesis sparked considerable controversy (13–16). Understanding how that controversy played out and was ultimately resolved by the remarkable success of the genetic association studies enabled by the HapMap requires an understanding of genetic variation, population genetics, and the evolution of the HapMap itself.
SNPs are sites in the genome sequence of 3 billion nucleotide bases where individuals differ by a single base. Roughly 10 million such sites, on average about one site per 300 bases, are estimated to exist in the human population such that both alleles have a frequency of at least 1% (3). Most SNPs are biallelic, or have only two forms, which contributes to their being relatively easy to type with automated, high-throughput genotyping methods (17). In addition, their generally low rate of recurrent mutation makes them stable markers of human evolutionary history (17).
In theory, identifying common SNPs associated with disease would involve the relatively straightforward — although time-consuming and expensive — task of typing all 10 million common SNPs in individuals with and without disease and looking for sites that differ in frequency between the groups. Such an approach would be very expensive and would not capture rarer variants or structural variants (such as insertions, deletions, and inversions) that are not identified by genotyping of SNPs. However, the pattern of association among SNPs in the genome suggests a potential shortcut, based on haplotypes and linkage disequilibrium (LD). A haplotype is the combined set of alleles at a number of closely spaced sites on a single chromosome. Nearby SNP alleles tend to be associated with each other, or inherited together more often than expected by chance, because most arise through mutational events that each occur once on an ancestral haplotype background and are inherited with that background, rather than arising multiple times de novo on different backgrounds (18). This is because for most SNPs the rate of mutation, or novel SNP generation, is relatively low (roughly 10–8 per site per generation, or 30 new variants per haploid gamete), as are the rate of recombination occurring with each meiosis and the number of generations (roughly 104) between currently living individuals and their most recent common ancestor (3). Each new allele is initially associated with the other allelic variants present on the particular stretch of ancestral DNA on which it arose, and these associations are only slowly broken down over time by recombination between SNPs and generation of new variants (Figure 2) (3).
Breakdown of LD around a new SNP. A mutation generating a novel SNP (red circle) occurs on an existing chromosome (dark blue) with multiple preexisting SNP alleles (dark blue circles) occurring in an ancestral haplotype that spans the entire chromosomal segment shown. After multiple meioses over many generations (arrows), the chromosomal segments flanking this variant will tend to be reshuffled by recombination, as shown by different colors. Over time, therefore, the segment containing the new variant and its surrounding ancestral SNP alleles becomes shorter and occurs on a variety of haplotypes associated with different flanking SNP alleles.
Two polymorphic sites are said to be in LD when their specific alleles are correlated in a population. High LD means that the SNP alleles are almost always inherited together; information about the allele of one SNP in an individual is strongly predictive of the allele of the other SNP on that chromosome (Figure 3). The LD between many neighboring SNPs generally persists because meiotic recombination does not occur at random, but is concentrated in recombination hot spots (19). Adjacent SNPs that lack a hot spot between them are likely to be in strong LD. A commonly used measure of LD, r2, can be interpreted as the proportion of variation in one SNP explained by another, or the proportion of observations in which two specific pairs of their alleles occur together. Two SNPs that are perfectly correlated have an r2 of 1.0, so that allele A of SNP1 in Figure 3, for example, is always observed with allele C of SNP2, and vice versa, while an r2 of 0 could be interpreted as an observation of allele A of SNP1 providing no information at all about which allele of SNP4 is present.
Tag SNPs can define common haplotypes. Variable sites (SNPs) are shown by colored bars in this simplified example (adjacent SNPs are generally separated by longer distances). Complete independence of these 6 SNPs would predict the possibility of 26 or 64 different haplotypes (because n biallelic SNPs could generate 2n haplotypes), but in reality just 4 haplotypes comprise 90% of observed chromosomes, indicating that LD is present. To be specific, SNP1, SNP2, and SNP3 are strongly correlated, and SNP4, SNP5, and SNP6 are strongly correlated, so that any of SNP1–SNP3 (or SNP4–SNP6) could serve as tags for the other 2 SNPs in each group. Specific tags may be chosen for genotyping platforms because of stronger associations with additional SNPs in the region or technical ease of genotyping.
Because humans are a relatively young species, and because recombination does not occur at random, there have generally not been enough recombination events to separate a variant from the ancestral background on which it arose (20). A small number of SNPs could theoretically produce an enormous number of haplotypes if every SNP allele could occur in combination with every other SNP allele (n biallelic SNPs could generate 2n haplotypes), but in practice, far fewer combinations make up the bulk of the haplotypes observed in a population (Figure 3) (18, 21). Because of the strong associations among the SNPs in most chromosomal regions, only a few carefully chosen SNPs (known as tag SNPs; ref. 3) need to be typed to predict the likely variants at the rest of the SNPs in each region.
The size of regions of strong LD varies dramatically across the genome, and to a lesser extent across populations, so that SNPs selected at random or even those spaced at regular intervals across the genome will not efficiently capture the bulk of genomic variation (3). The mean size of regions of strongly associated SNPs, sometimes called haplotype blocks, is estimated to be 22 kb in populations of European or Asian ancestry and 11 kb in populations of recent African ancestry (18). This difference among populations is expected based on population size and migration history; compared with the parent populations, populations with founder effects (a few relatively isolated individuals whose descendants intermarry) have larger regions with stronger associations among SNPs. It has been estimated that most of the variation in the human genome could be captured by genotyping several hundred thousand to 1 million tag SNPs, but selection of the best tag SNPs requires precise mapping of the patterns of LD (3). This was the justification for developing the human haplotype map (1, 3, 22, 23).
The International HapMap Project was a consortium among researchers in Canada, China, Japan, Nigeria, the United Kingdom, and the United States, organized to consider the ethical issues, develop the scientific plan, choose the populations and SNPs to be typed, carry out the genotyping and data analysis, and release the data into the public domain (1, 3). The consortium produced a human haplotype map by genotyping 270 samples, from four populations with diverse geographic ancestry, provided by people who gave consent specifically for this project and related research. These samples included 30 trios (mother, father, and adult child) from the Yoruba in Ibadan, Nigeria; 30 trios from the Centre d’ωtude du Polymorphisme Humain collection of Utah residents of Northern and Western European ancestry; 45 unrelated Han Chinese in Beijing; and 45 unrelated Japanese in Tokyo (24). The Utah samples were previously collected but were reconsented for this purpose. New samples were collected from the Yoruba, Han Chinese, and Japanese after processes of community engagement (25). The newly collected samples were permanently disconnected from individual identifiers and had no associated phenotype data. Cell lines and DNA from the samples are available for research from the nonprofit Coriell Institute for Medical Research (26).
Approximately 1 million SNPs were genotyped in phase I of the project, and a description was published in 2005 (1). This was followed by the phase II HapMap of over 3 million SNPs, published in 2007 (2). Genotyping in phase II was attempted for about 4.4 million distinct SNPs, of which roughly 1.3 million either could not be typed, were not polymorphic in any of the populations, or did not pass genotyping quality control filters. Certain regions of the genome were recognized as being challenging to study, such as centromeres, telomeres, gaps in genome sequence, and segmental duplications, and only one attempt was made to develop a genotyping assay before such a region was declared to be not HapMap-able (1). All the genotype data are freely available from the HapMap Data Coordination Center (27) and dbSNP (28).
These data revealed the pattern of association among SNPs in the genome and how these patterns vary across populations. Although the four populations studied show generally similar patterns of variation, the Yoruba population has less LD overall and shorter haplotype blocks, as noted above, but the regions with higher LD are similar across the populations. The diversity of haplotypes within blocks also varies across populations, with the Yoruba having an average of 5.6 haplotypes in each block compared with 4.0 in the Japanese and Han Chinese populations (1).
Studies in additional populations have shown that the tag SNPs chosen using the HapMap are generally transferable across other populations, but there are some limitations, particularly for rarer SNPs and for populations with substantial proportions having recent African ancestry (29). Fluctuations in estimates of allele frequency and LD because of small sample sizes also limit the transferability of HapMap-derived tag SNPs, so additional samples from the populations used to develop the HapMap as well as from seven more populations have recently been genotyped across the genome (30).
Advances in genotyping technology have vastly increased the number of variants that can be typed and decreased the per-sample costs (31–34). These advances have made possible the dense genotyping needed to capture the majority of SNP variation within an individual at a sufficiently low cost to allow the large sample sizes needed for comparison of individuals with and without disease. When the HapMap Project began, the cost per sample per SNP was about $0.40; by 2005 the cost had dropped to about $0.01, and the current cost is about one-tenth of that for platforms typing nearly 1 million SNPs at once.
Information generated by the HapMap on LD patterns among SNPs has permitted the design of efficient and comprehensive genotyping platforms by elucidating tag SNPs that serve as proxies for the largest number of SNPs and eliminating redundant SNPs or SNPs that cannot be assayed reliably (1). Currently, genome-wide scans cost less than $1,000 per sample and include about 1 million SNPs, with more SNPs in regions with low LD than in regions with high LD. As genotyping platforms are developed to allow for an increase in the number of tag SNPs typed, they capture more variation in every population, so that even samples of recent African ancestry have most of the genome covered at high r2 (Table 1) (2).
Accuracy of these platforms is paramount, because genotyping errors such as incorrectly typing some heterozygotes as homozygotes can cause spurious results and obscure the true associations, particularly if errors are differential between cases and controls (35, 36). Genotyping errors can also affect parent-offspring trio studies that are robust to other types of bias, such as differences in ancestral origin (population structure) between individuals with and without disease (37). Efforts to improve the accuracy of genotyping platforms and genotype calling algorithms are continuing and rely heavily on use of HapMap samples and data for quality assessment (32, 35, 38). An important step in evaluating the reliability of findings at present is ensuring that they are repeated on a second, independent genotyping platform (39).
Cost efficiency of genotyping platforms is a major consideration for GWA studies because of the very large sample sizes needed to detect genetic variants of modest effect. To provide the same statistical power for detecting a true association between a genetic variant and a disease, assuming such an association exists, sample sizes must increase with the following: (a) greater number of genotypes and association tests performed, and thus greater probability of spurious associations (type I error); (b) greater genotyping error or phenotypic misclassification; (c) lower size of the genetic effect (risk of disease conferred by the disease-associated allele); (d) lower frequency of the risk allele; (e) lower r2 between the disease-associated SNP and the tag SNP typed on the platform; and (f) heterogeneity of the genetic association, caused by multiple genes that contribute to the disease, ancestry differences across population subsets, or gene-gene or gene-environment interactions.
The number of tests is a major factor in determining the statistical power of GWA studies, in which 106 or more association tests (at least one for each SNP) are performed. Although these tests are not strictly independent because of LD, the current convention is to apply a Bonferroni correction (which assumes independence and is thus overly conservative) by dividing the conventional P value of 0.05 by the number of tests performed (40). This requires P values in the 5 × 10–7 to 5 × 10–8 range to define an association, a stringent level of significance. Were one to be satisfied with a P value of 0.05, detecting a variant of 10% allele frequency conferring a 1.5-fold increased risk with 80% statistical power would require only 360 cases and 360 controls, but 50,000 potentially spurious associations would be expected by chance out of 1 million SNPs tested. Lowering the P value to 5 × 10–7 vastly reduces the number of spurious associations but requires a more than 4-fold increase in the sample size, to roughly 1,590 cases and 1,590 controls, for the same statistical power (41). Risk associated with a variant is often assessed by the odds ratio (OR), the odds of disease in individuals with the variant divided by the odds of disease in those without the variant. ORs for many of the genetic variants believed to contribute to the risk of complex diseases are likely to fall in the 1.2–1.3 range or lower, considerably lower than the OR of 3.2, for example, for Alzheimer disease associated with the apolipoprotein E ε4 allele (42) or the OR of 4.1 for deep venous thrombosis associated with oral contraceptive use (43). Statistical power is known to decline steeply below an OR of 1.2 (44) and as minor allele frequency (MAF) falls (23); a study of 6,000 cases and 6,000 controls has been estimated to provide statistical power of 94%, 43%, and 3% for MAFs of 0.1, 0.05, and 0.02, respectively, conferring an OR of 1.3 at P < 10–6 (44).
Current generation high-throughput genotyping platforms are extraordinarily efficient at genotyping SNPs, but, as stated above, they are less effective at genotyping structural variants, such as insertions, deletions, inversions, and copy number variants. Although not as common as SNPs, these variants also occur commonly in the human genome (45). The HapMap was not designed to capture these variants, although it can be used indirectly to do so, particularly for small deletions that are in LD with SNPs (46–48). Copy number variants, in which stretches of genomic sequence of roughly 1 kb to 3 Mb in size are deleted or are duplicated in varying numbers, have gained increasing attention because of their apparent ubiquity and potential dosage effect on gene expression (47, 49, 50). A variety of diseases such as DiGeorge syndrome and α-thalassemia have been shown to be caused by large deletions, insertions, and other structural variants, and the potential for structural variants to influence phenotypes in healthy individuals is now recognized (51, 52). Expansion and refinement of current genotyping platforms increasingly focus on capturing copy number variants adequately, and some success has already been achieved (38, 53). Array and sequencing methods are also being used to type structural variants using the HapMap samples for development and cross-validation of the methods (54, 55).
The technological advances directly stimulated or indirectly facilitated by the HapMap have had a profound impact on the study of the genetics of common diseases, exceeding even the expectations of the project’s originators (4). Not only has the HapMap enabled a new generation of genetic association studies through the application of high-density, genome-wide genotyping to carefully characterized individuals with and without disease, but it has also stimulated the development and testing of analytic methods for reducing spurious associations (56, 57), assessing claims of replication of genotype-phenotype associations (39, 58), identifying and adjusting for ancestry differences among individuals and groups (35, 59, 60), and imputing untyped SNPs across different genotyping platforms (61).
The short history of high-density GWA scanning (i.e., about 100,000 SNPs or more) to date has demonstrated the striking success of this approach in finding genetic variants associated with disease. Variants or regions associated with nearly 40 complex diseases or traits have been identified and replicated in diverse population samples (Figure 4). Complex conditions as dissimilar as macular degeneration and exfoliative glaucoma (Table 2), diabetes (Table 3), cancer (Table 4), inflammatory bowel disease (Table 5), cardiovascular disease (Table 6), neuropsychiatric conditions (Table 7), autoimmune and infectious diseases (Table 8), and a variety of anthropometric and laboratory traits (Table 9) have recently yielded strong, convincing, replicated associations in GWA scanning. Several of these discoveries have suggested etiologic pathways not previously implicated in these diseases, such as the autophagy pathway in inflammatory bowel disease (62), the complement pathway in macular degeneration (63), and the HLA-C locus in control of viral load in HIV infection (64). Note that the estimated ORs for most of these associations are relatively modest, at 2.0 or less, although smaller studies of rare diseases can give quite large ORs (and very wide confidence intervals), as in the case of the 20-fold increased risk (95% confidence interval, 10.8 to 37.4) associated with rs3825942-G in exfoliation glaucoma (65).
SNP-trait associations detected in GWA studies. Associations significant at P < 9.9 × 10–7 are shown according to chromosomal location and involved or nearby gene, if any. Colored boxes indicate similar diseases or traits.
A major strength of the genome-wide approach facilitated by HapMap-based genotyping platforms has been its freedom from reliance on prior knowledge, imperfect as it is, of genes likely to be related to the trait of interest. Instead, GWA studies survey the entire genome in a comprehensive, systematic, even agnostic manner, relaxing the dependence on strong prior hypotheses (39). Most of the associations found in these studies have not been with genes previously thought to be related to the disease under study, and some of the most reliably replicated associations, such as those of the chromosome 8q24 region and prostate cancer (66, 67), or the 5p13.1 region and Crohn disease (35, 68), have been in genomic regions carrying no known genes at all (69). Of considerable interest in determining pathophysiology have been variants or regions implicated in multiple diseases, such as the 8q24 region in prostate, breast, and colorectal cancer (66, 70, 71) and the PTPN2 gene in type 1 diabetes and Crohn disease (35). Notable among these are the CDKN2A/B cell-cycle variants on chromosome 9p21, which have been implicated in coronary disease (72, 73), type 2 diabetes (61, 74), and frailty (75). Prior interest in CDKN2A/B focused on the fact that germline deletions of these genes confer a risk of familial malignant melanoma (76), and it is surprising to see potentially regulatory variants of these same cell-cycle genes implicated in these additional common conditions.
In addition to its pivotal role in the design of genotyping platforms, the HapMap played important roles in these discoveries, include providing better estimates of allele frequencies (64), comparing allele frequencies across the four HapMap populations (72), identifying additional variants for testing, and defining LD blocks and the genes contained within them. These LD patterns are critical in following up initial association findings because they help in selecting variants for cost-effective follow-up genotyping (64, 77), suggesting independence of closely located SNPs in regions of low LD, interpreting failure to detect associations with previously identified variants in populations with varying LD patterns (78), and, crucially, defining haplotypes containing the disease-associated variants (64) (see Use of HapMap data in association studies and Use of HapMap cell lines in genomic research).
As noted above, GWA studies have provided startling new insights into pathophysiology, such as the role of the complement system in macular degeneration (63) or the potential for genetic variants that reduce the efficiency of intracellular mechanisms for disposing of unwanted cytoplasmic constituents (autophagy) to cause disease (62).
In addition to the pathophysiologic implications of genetic discoveries based on the HapMap, these findings have raised the possibility of using general population-based screening, or more targeted screening of individuals with positive family histories for these conditions, for identifying high-risk, presymptomatic subjects, determining the earliest manifestations of these conditions, and facilitating early trials of preventive therapies (79). Although the increases in risk detected in these studies are typically modest, in the 1.2- to 1.5-fold range as noted above, these associations can point the way to important therapeutic avenues and, when considered in combination, may identify individuals at substantially increased risk (61). This information can be particularly important, even in the absence of specific pharmaceutical agents targeted to such individuals, for more aggressive efforts to reduce known risk factors that can be modified, such as obesity in prediabetes and smoking in age-related macular degeneration (AMD) (80, 81). Even modest risk factors may be valuable in individualizing surveillance programs such as mammography, prostate-specific antigen (PSA) screening, or colonoscopy, although further research will be needed to explore the effectiveness of such approaches. To the degree that they determine treatment response, genetic variants may also be useful in tailoring pharmacologic therapy to individuals most likely to respond — and not react adversely — to specific treatments (82).
In the long run, the greatest contribution of genetic discoveries facilitated by the HapMap may be in the identification of new therapeutic targets. Such treatments may well be effective in individuals without the specific genetic variant that led to the discovery of these targets. Perhaps the best example is the development of HMG-CoA reductase inhibitors that effectively lower cholesterol levels in nearly everyone who takes them — except, ironically, in individuals with homozygous absence of LDL-receptors who were instrumental in identifying this key metabolic pathway (83). Even variants with very modest ORs may provide clues to key drug targets, as demonstrated by 2 diabetes-related genes. First, the PPARG Pro12Ala variant has an OR of 1.25 for diabetes, but the protein product of this gene is recognized as the receptor for the thiazolidinedione class of insulin sensitizers, also referred to as PPARγ agonists (80, 84). Second, variants of the KCNJ11 gene have been associated with diabetes, although with an OR of 1.2, in a variety of GWA and other studies, but KCNJ11 codes for the sulfonylurea receptor, a major target for diabetes drug therapy (82, 85).
An important use of HapMap data is to test for the presence of population structure, or allele frequency differences related to geographic (and presumably ancestral) differences within and across study populations, even in relatively homogeneous groups, such as the Britons studied in the Wellcome Trust Case Control Consortium (WTCCC; ref. 35). Thirteen genomic regions were found to differ significantly among geographic areas of Great Britain once samples of non-European ancestry were omitted (based on estimates of the genetic distance of individual WTCCC samples from the three original HapMap analysis panels, another key use of HapMap data), but this divergence had little impact on the genetic associations identified with the seven common diseases studied by that consortium. Although some geography-based differences may be just the result of population drift or founder effects, they do provide tantalizing clues to possible selection pressures on populations ancestral to those now residing in the United Kingdom (35).
In fact, HapMap data have provided critical evidence in support of recent positive selection, or selection in favor of new alleles, for genes related to response to infectious agents such as malaria, dietary factors such as disaccharides and fatty acids, and pigmentation differences that confer advantages at different latitudes (86, 87). Such analyses rely on the fact that under strong positive selection, an allele may rise to high frequency so rapidly that associations extend for substantial distances along chromosomes (the long-range haplotype; ref. 88) because there has not been time for them to be broken down by recombination (89). Regions of unusually low diversity suggestive of such selective sweeps have sometimes been detected in three or all four HapMap population samples, but are more commonly found in one or two populations, presumably because of local environmental selective pressures. They are also easier to detect in a single population by comparison with the other populations (88). Investigation of such loci may yield valuable insight into pathways governing responses to environmental pathogens and other functional effects as yet unsuspected.
HapMap data have also been crucial in facilitating the pooling and comparison of association data across populations, so that differences in ancestral background can now be adjusted for in a continuous fashion without loss of data through exclusion or loss of statistical power through stratification (90). Allele frequency differences in HapMap populations have been used to suggest reasons for differences in associations in individuals of varying geographic origin, often manifest as an association in persons of European ancestry that fails to replicate in other groups, particularly those of more recent African origin (72, 91, 92). Lower LD among SNPs in the HapMap Yoruba population, and in other individuals of recent African ancestry (91), has also been cited as a reason for cross-population differences in associations. Similar comparisons have been possible with Asian populations (72, 92) and have been effective in focusing on more likely causative SNPs affecting all populations.
Important conclusions from these preliminary successes include the relatively modest effect sizes observed for genetic variants associated with common diseases and the consequent need for very large sample sizes to detect them. Sample sizes for detecting and confirming variants related to diabetes, obesity, and breast cancer have been in the many tens of thousands (61, 93, 94). These contrast sharply with the success of early AMD studies, in which only 100–200 cases were needed, but the identified variants conferred a much greater risk of disease (63, 95). Estimates of residual heritability after accounting for the variants found in this first round of analyses suggest that numerous other variants of modest effect, undiscovered structural variants, or less common variants of large effect remain to be found for most of these diseases (61, 90, 96). It is important to note that the most robust findings, those that survive multiple rounds of replication in an initial study and are subsequently replicated in other studies, are often not the most statistically significant associations in the initial GWA scan, and may not even be in the top few hundred associations (70, 97).
Another important lesson from these studies has been that variants in noncoding regions — rather than nonsynonymous coding SNPs, which code for different amino acids in the resulting protein — are likely to be causative in most instances (35). That regulation of the protein products, rather than differences in the structure or function of the protein, may be most important for disease risk was suspected before the advent of the HapMap and GWA studies, but the relative importance of each to disease risk was unknown, demonstrating the value of an agnostic approach to genome-wide interrogation (39, 98).
Several studies have shown that tag SNPs chosen on the basis of the data from the four populations included in HapMap phases I and II apply well to other populations (2, 29). Still, to allow better choice of tag SNPs and more detailed analyses for various populations, additional samples were collected from the same four initial HapMap populations and from seven additional populations: Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Tuscans in Italy; Gujarati Indians in Houston, Texas; Chinese in metropolitan Denver, Colorado; Mexican ancestry in Los Angeles, California; and African ancestry in the southwestern United States. These 1,301 extended HapMap samples, now available from the Coriell Institute, have been genotyped on the Affymetrix 6.0 platform and the Illumina 1 million SNP chip, and genome-wide sequencing of these samples will begin soon. As with the initial HapMap samples, these will become a standard research resource for many additional studies and will be particularly useful in providing additional information on rare variants.
The GWA studies described above have shown substantial early promise, and new applications of genome-wide technology to well-characterized population samples are continuing. However, important limitations of GWA studies should be kept in mind, including their lack of statistical power for identifying associations with rare sequence variants, because these are poorly represented on current genotyping platforms, and the need for very large numbers of samples (see Limitations of HapMap-based GWA studies). The benefits of collaboration across multiple GWA studies for replicating initial associations and developing common methods have been amply demonstrated by the pioneering efforts of the WTCCC study of seven complex diseases and common controls (35). Several other collaborative programs are currently in the pipeline (Table 10).
Data from many of these GWA studies are released to the scientific community through the Database of Genotype and Phenotype (dbGaP) of the National Center for Biotechnology Information (99). Study descriptions, protocols, data summaries, and other group-level data are available in the open-access portion of the dbGaP website, while individual-level data are provided through a controlled-access process consistent with the informed consent provided by study participants, as described in the recently finalized policy for sharing of data obtained in NIH-supported or -conducted GWA studies (100). This commitment to rapid data release builds on the now well-established ethic in genomic community research projects of maximizing data access.
Research to pursue initial GWA discoveries will include replication studies in the same phenotypes and populations to ensure the robustness of the findings and in similar but not identical phenotypes and populations to extend the findings and increase understanding of their mechanisms and importance (39). Investigation of disease subtypes, such as estrogen receptor–positive versus –negative breast cancer, or young-onset or severely progressive forms of prostate cancer or diabetes, may be of great value in identifying the subgroups of alleles conferring the highest risk and the individuals who carry them (101). Sequencing DNA from large numbers of people for the genomic regions showing strong associations with complex traits, guided by HapMap data on LD patterns to identify limits of regions to be sequenced, will be needed to identify rare, potentially causal variants poorly tagged by existing genotyping platforms (102). The recently initiated 1,000 Genomes Project will produce light sequence coverage (an average of two sequencing reads at any place in the genome) of about 1,000 individuals that will greatly extend the catalog of human genetic variation and limit follow-up sequencing of specific genotype-phenotype association findings to the search for very rare variants. Fine-mapping of candidate regions with SNPs optimally chosen based on HapMap data to maximize the regional genomic variation captured while minimizing costs will refine association signals and narrow the list of possible functional variants. Functional studies of this smaller list of variants in experimental models such as knockdown and overexpression studies (102) and in examining relationship to gene expression, as recently demonstrated for asthma-associated variants in ORMDL3 (103), will help to determine the mechanisms of gene function and how they are perturbed in disease, providing insights into possible preventive or therapeutic strategies. Finally, translation of these strategies into improved detection or targeting of high-risk individuals (61, 102) or pharmacotherapies derived directly from knowledge of gene function (82) will be needed if these efforts are ultimately to improve health and reduce disease. Much work remains to be done, but early successes in genetic risk factor discovery through large-scale GWA studies appear finally to have unlocked the door to significant improvements in health and clinical care in common complex disease based on genomic knowledge.
The International HapMap Project has been an extensive international collaborative effort in which common objectives were agreed upon and pursued in a highly focused, cooperative approach. Our understanding of haplotype patterns in Homo sapiens continues to evolve, with additional populations being added, additional variants being identified through targeted and genome-wide sequencing, and cellular phenotypes being characterized in transformed and inexhaustible lymphoblastoid cell lines. Successful GWA studies are the most visible and exciting outcome of HapMap to date, with the large number of robust and highly replicated genetic associations with common diseases providing novel and unexpected insights into the pathophysiology of disease. The HapMap has also been invaluable in developing genotyping and analytic methods, expanding our understanding of evolutionary pressures and natural selection, defining genetic relationships across populations, and providing samples for validation of variation detection methods and standardization of laboratory processes. Application of these association findings is expected to produce new advances in the prevention and treatment of common diseases.
Note added in proof. Updated information on GWA studies and SNP associations is available online (139).
The authors express their sincere appreciation to Mia Diggs, Lucia Hindorff, Heather Junkins, Darryl Leja, and Lisa McNeil for assistance in preparation of the manuscript.
Address correspondence to: Teri Manolio, National Human Genome Research Institute, 31 Center Drive, Room 4B-09, Bethesda, Maryland 20892-2154, USA. Phone: (301) 402-2915; Fax: (301) 402-0837; E-mail: manolio@nih.gov.
Nonstandard abbreviations used: GWA, genome-wide association; LD, linkage disequilibrium; OR, odds ratio.
Conflict of interest: The authors have declared that no conflict of interest exists.
Reference information: J. Clin. Invest.118:1590–1605 (2008). doi:10.1172/JCI34772.