Genetic epidemiology sits at the intersection of genetics and epidemiology, asking how inherited variation and environmental exposures jointly shape disease patterns in populations. Its central challenge has been to move beyond observing that diseases run in families to identifying the specific genes, estimating their effects, distinguishing causation from correlation, and ultimately predicting individual risk. Over the past half century, the field has passed through five distinct frameworks, each driven by different technologies, study designs, and analytic ambitions. Early family-based methods gave way to molecular linkage, which was then transformed by genome-wide association studies. In parallel, a separate causal-inference tradition emerged, and more recently, polygenic risk scores have tried to translate discovery into prediction. These frameworks now coexist, each addressing a different facet of the same problem.
Before molecular markers existed, genetic epidemiologists relied on patterns of disease occurrence within families to infer whether a genetic component existed and how it might be transmitted. Familial Aggregation and Segregation Analysis (roughly 1970–1990) formalized this reasoning. Familial aggregation studies used case-control or cohort designs to ask whether relatives of affected individuals were more likely to develop the disease than relatives of unaffected individuals, often summarised by a recurrence risk ratio. Segregation analysis went further by fitting statistical models—typically likelihood-based—to pedigree data to test whether the observed distribution of disease was consistent with a particular mode of inheritance (dominant, recessive, additive) and to estimate parameters such as penetrance and allele frequency. This framework could not pinpoint the responsible gene, but it provided crucial evidence that a genetic component existed and that it was not purely environmental. It also laid the groundwork for the next framework by demonstrating that many common diseases showed familial patterns that did not follow simple Mendelian inheritance, hinting at a more complex genetic architecture.
Linkage Analysis (1980–2005) exploited the availability of polymorphic DNA markers—first restriction fragment length polymorphisms, then microsatellites—to track the co-inheritance of marker alleles and disease within pedigrees. The core logic is that a marker physically close to a disease-causing variant will be inherited together with it more often than expected by chance. In parametric linkage analysis, a LOD score (log of odds) measures the likelihood of linkage at a given recombination fraction versus no linkage. The method was spectacularly successful for rare, highly penetrant Mendelian disorders: it mapped the genes for cystic fibrosis (1989), Huntington disease (1993), and BRCA1-linked breast cancer (1994). Yet as investigators turned to common diseases such as type 2 diabetes, hypertension, and schizophrenia, linkage repeatedly failed to produce robust signals. The reason became clear: for complex traits influenced by many small-effect variants, the co-inheritance signal within families is too weak to detect reliably. Linkage analysis was not abandoned—it remains the method of choice for mapping rare Mendelian mutations—but it could not handle the genetic architecture that common diseases turned out to have.
The limitations of linkage for complex traits prompted a shift from family-based to population-based designs. Genome-Wide Association Studies (GWAS, 2005–present) compare the frequency of hundreds of thousands or millions of single-nucleotide polymorphisms (SNPs) between cases and controls from unrelated individuals. The key assumption, known as the common-disease/common-variant hypothesis, is that genetic susceptibility to common diseases is largely due to variants that are relatively frequent (minor allele frequency >5%). GWAS does not require pedigrees; instead it relies on linkage disequilibrium—the non-random association of nearby SNPs—to tag causal variants indirectly. The first successful GWAS (for age-related macular degeneration, published in 2005) demonstrated that the approach could find robust associations. Rapid advances in genotyping technology and the formation of large international consortia (e.g., the GWAS Catalog) soon led to hundreds of loci for dozens of diseases. However, the individual effect sizes were typically very small (odds ratios of 1.1–1.3), and the total heritability explained by all discovered loci fell far short of estimates from family studies—a puzzle dubbed 'missing heritability.' GWAS transformed the field from a hypothesis-driven to a data-driven discovery enterprise, but it also left a clear gap: association is not causation. A SNP may be associated with disease because it directly influences the phenotype (causal), because it is correlated with the causal variant (tagging), or because of confounding.
Mendelian Randomization (MR, 2003–present) emerged not from the discovery-oriented genetics tradition but from the causal inference tradition within epidemiology. Its intellectual roots lie in instrumental variable analysis. The idea is to use a genetic variant as an instrument for an exposure of interest (e.g., low-density-lipoprotein cholesterol as a risk factor for coronary heart disease). Because alleles are allocated at conception according to Mendel's laws, genetic associations are not subject to many of the confounders that plague observational studies (e.g., socioeconomic status, lifestyle). If a genetic variant that influences the exposure is robustly associated with the outcome, that provides evidence that the exposure is causally related to the outcome, under key assumptions (relevance, independence, exclusion restriction). The first widely cited MR studies appeared around 2003, before GWAS became dominant; indeed, MR initially used candidate gene variants. After GWAS began providing many reliable SNP–exposure associations, MR gained new power as a post-GWAS causal inference tool. MR and GWAS thus coexist with complementary goals: GWAS discovers associations, MR tests whether those associations reflect causal pathways. The two frameworks are not opposed—they often appear in the same paper—but they originate from different intellectual traditions (genetic discovery vs. epidemiologic causal inference) and appeal to different audiences (geneticists vs. epidemiologists). A major challenge for MR is pleiotropy: a variant may affect the outcome through pathways other than the exposure of interest. Methods such as MR-Egger and weighted median estimators attempt to handle this, but the debate over pleiotropy remains active.
Polygenic Risk Scores (PRS, 2009–present) are a direct translational extension of GWAS. The idea is to combine the effects of many SNPs across the genome—typically weighted by their GWAS effect sizes—into a single numeric score that predicts an individual's genetic liability to a disease. PRS does not identify new genes; it aggregates the cumulative effect of many small associations. The first demonstration (for schizophrenia, 2009) showed that a score derived from a large GWAS could significantly discriminate cases from controls, although the predictive power was modest. Since then, PRS has been developed for dozens of traits, and its accuracy improves as GWAS sample sizes grow. PRS depends entirely on the summary statistics from GWAS and therefore inherits all of GWAS's limitations: it captures only common variants, its predictive power is constrained by the heritability explained, and it is highly sensitive to the ancestry of the discovery sample. A PRS derived from European GWAS performs poorly in non-European populations, raising ethical and equity concerns. PRS is increasingly being explored for clinical risk stratification, early screening, and lifestyle recommendations, but there is active debate about whether current PRS have sufficient accuracy to be useful in routine care. The framework has also revived interest in familial aggregation: because PRS captures the polygenic component of family risk, it partly explains why relatives of patients are at higher risk.
Today, GWAS, MR, and PRS are the dominant frameworks, each with a distinct role. GWAS continues to expand in scale (now with biobanks such as UK Biobank and FinnGen) and has recently begun to incorporate exome and whole-genome sequencing, challenging the common-disease/common-variant assumption. MR provides a causal inference toolkit that is widely used to triangulate evidence from observational studies and randomized trials. PRS is the closest genetic epidemiology has come to clinical prediction, but its utility is constrained by ancestry bias and modest effect sizes. The frameworks agree that most common diseases have a substantial polygenic component, but they disagree on the relative importance of rare variants: the 'common disease, rare variant' hypothesis holds that a significant portion of missing heritability lies in rare (MAF < 0.5%) variants with larger effects, which current GWAS arrays do not capture well. Whole-exome and whole-genome sequencing studies are now testing this hypothesis. There is also disagreement about whether MR findings can be trusted in the face of widespread pleiotropy, and whether PRS should be deployed clinically before their validity across diverse ancestries is established. These debates do not threaten the frameworks but push them to refine their methods and integrate data across study designs. The coexistence of discovery, causal, and predictive frameworks reflects the maturity of a field that has learned to use multiple tools to answer an inherently multi-part question: what is the inherited architecture of disease, what causes it, and can we predict it?