Epidemiologists have long faced a fundamental problem: how to distinguish genuine causes of disease from mere associations in observational data. Unlike randomized experiments, where treatment assignment is controlled, observational studies are plagued by confounding, selection bias, and measurement error. The history of causal inference in epidemiology is the story of how the field moved from informal heuristics to a suite of formal frameworks, each designed to make causal claims more rigorous and transparent.
Before the 1960s, epidemiologists relied on general scientific judgment to decide whether an association was causal. The turning point came in 1965, when Sir Austin Bradford Hill proposed a set of nine viewpoints—strength of association, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, and analogy—to guide causal reasoning. The Bradford Hill Viewpoints were never intended as a checklist; Hill himself warned that none provided hard proof. Instead, they offered a shared vocabulary for debating evidence. For decades, these viewpoints served as the field's primary framework for causal assessment, especially in chronic disease epidemiology. Their strength lay in their flexibility, but their weakness was equally clear: they provided no formal definition of causation, no method for quantifying bias, and no way to distinguish direct from indirect effects. As the field encountered more complex questions—time-varying exposures, unmeasured confounders, and mediation pathways—the need for a more precise language became urgent.
The 1970s brought a conceptual breakthrough: the formal definition of causation in terms of counterfactuals. The Potential Outcomes Framework, introduced by Donald Rubin in 1974, defined the causal effect of a treatment on an individual as the difference between the outcome if treated and the outcome if untreated. Since only one of these potential outcomes is ever observed, the framework shifted the problem to estimating the average treatment effect in a population under assumptions of exchangeability, positivity, and consistency. This was a radical departure from heuristic reasoning: causation was now a mathematical quantity to be identified, not a judgment to be weighed.
Almost simultaneously, Kenneth Rothman proposed the Sufficient-Component Cause Model in 1976, often visualized as "causal pies." In this model, a disease arises when a sufficient set of component causes is present; each component is necessary only within that specific pie. Unlike the Potential Outcomes Framework, which is inherently probabilistic and focuses on average effects, Rothman's model is deterministic and emphasizes the interaction of multiple causes at the individual level. The two frameworks coexisted uneasily: the Potential Outcomes Framework excelled at defining effects and guiding estimation, while the Sufficient-Component Cause Model captured the intuition that causation is often multifactorial and that removing a single component can prevent disease. This tension between probabilistic and deterministic views of causation remains a live philosophical undercurrent in the field.
With a formal definition of causation in hand, the next challenge was to identify causal effects from observational data where treatment assignment is not random. In 1983, Paul Rosenbaum and Donald Rubin introduced Propensity Score Methods, showing that conditioning on the probability of receiving treatment given observed covariates can balance treatment and control groups, mimicking randomization. Propensity scores became a workhorse for confounding adjustment in cross-sectional and cohort studies, especially when the number of covariates is large relative to events.
But propensity scores assume that all confounders are measured and that confounding is time-invariant. For longitudinal studies where exposures and confounders change over time—and where past treatment affects future confounders—standard adjustment can introduce bias. In 1986, James Robins developed G-methods (including G-computation, inverse probability weighting of marginal structural models, and G-estimation) specifically to handle time-varying confounding affected by prior treatment. G-methods extended the counterfactual logic to complex longitudinal data, a domain where propensity scores alone were insufficient. Today, both approaches remain active: propensity scores are simpler and widely used for point exposures, while G-methods are the standard for time-varying treatments and for estimating the parameters of structural nested models.
Causal inference is not only about whether a treatment causes an outcome, but also how. In 1992, Tyler VanderWeele and James Robins formalized Causal Mediation and Interaction Analysis, providing definitions and identification conditions for direct and indirect effects. This framework decomposed a total effect into a part that operates through an intermediate variable (the mediator) and a part that does not. It required careful assumptions about no unmeasured confounding of the exposure-mediator and mediator-outcome relationships, and it introduced concepts like controlled direct effects and natural direct and indirect effects. Mediation analysis gave epidemiologists tools to study mechanisms, but it also revealed how sensitive conclusions are to assumptions about unmeasured confounders—a theme that recurs throughout the field.
A major unification came with the adoption of Causal Diagrams and Directed Acyclic Graphs (DAGs) in epidemiology, popularized by Sander Greenland and James Robins in 1999. DAGs provided a visual language for encoding causal assumptions: arrows represent direct causal effects, and the absence of an arrow represents the assumption of no direct effect. Using graphical criteria such as the back-door criterion and the front-door criterion, researchers could identify which variables to condition on to block confounding and which to avoid conditioning on to prevent collider bias. DAGs did not replace the Potential Outcomes Framework; instead, they gave it a graphical interface that made assumptions explicit and debatable. They became a common language bridging different methodological camps.
Building on DAGs, Judea Pearl's Structural Causal Models (SCMs), introduced around 2000, integrated graphical models with a formal calculus of interventions (the do-operator) and counterfactuals. SCMs offered a unified mathematical framework for reasoning about causation, including mediation, confounding, and instrumental variables. This sparked a productive rivalry with the Potential Outcomes tradition. The two camps agreed on many identification results but differed in emphasis: SCMs prioritized graphical and nonparametric identification, while the Potential Outcomes tradition focused on estimation strategies and the role of the assignment mechanism. Over time, the boundaries blurred—many researchers now draw on both traditions, using DAGs for identification and potential outcomes for estimation.
Despite advances in adjustment, unmeasured confounding remains the Achilles' heel of observational studies. Instrumental Variable and Natural Experiment Methods, which entered epidemiology around 2000, offered a way to estimate causal effects even when confounders are unmeasured, provided a valid instrument exists: a variable that affects the exposure, affects the outcome only through the exposure, and is not confounded with the outcome. Natural experiments—such as policy changes, weather shocks, or genetic inheritance—provide instruments that mimic randomization. These methods expanded the epidemiologist's toolkit beyond adjustment, but they come with strong assumptions (relevance, exclusion restriction, exchangeability) that are often difficult to verify.
A particularly influential application of instrumental variable logic is Mendelian Randomization, proposed in 2003 by George Davey Smith and Shah Ebrahim. Mendelian randomization uses genetic variants as instruments for modifiable exposures, exploiting the random assortment of genes at conception to mimic a randomized trial. It is a specialized case of instrumental variable analysis, but its reliance on genetic data introduced new challenges—population stratification, pleiotropy, and the need for large sample sizes. Mendelian randomization has become a major subfield in its own right, especially for studying the causal effects of biomarkers, lifestyle factors, and other exposures that are difficult to randomize.
As datasets grew larger and more complex, traditional parametric models became limiting. Targeted Learning, developed by Mark van der Laan and colleagues starting in 2010, provided a framework that combines machine learning with causal inference. Its core idea is to target the estimation of a specific causal parameter (e.g., the average treatment effect) rather than modeling the entire data-generating distribution. Targeted maximum likelihood estimation (TMLE) uses flexible machine learning algorithms for nuisance functions (propensity scores and outcome regressions) while preserving asymptotic properties like double robustness and efficiency. Targeted Learning represents a shift from model-based to algorithm-based estimation, and it has been widely adopted in HIV research, vaccine trials, and environmental epidemiology.
The most recent major framework, Target Trial Emulation, was formalized by Miguel Hernán and James Robins in 2016. It addresses a practical problem: how to design an observational study so that it mimics a randomized trial as closely as possible. The idea is to specify the protocol of a hypothetical target trial—including eligibility criteria, treatment strategies, assignment procedures, follow-up, and outcome definitions—and then emulate each component using observational data. Target Trial Emulation does not replace earlier frameworks; rather, it integrates them. It uses the Potential Outcomes Framework to define the causal question, DAGs to identify confounders, propensity scores or G-methods to adjust for confounding, and instrumental variable methods when unmeasured confounding is suspected. By forcing researchers to articulate their assumptions explicitly, it reduces the risk of ad hoc analyses and increases the credibility of causal claims from observational data.
Today, causal inference in epidemiology is a pluralistic field. The leading frameworks—Potential Outcomes, Structural Causal Models, G-methods, Targeted Learning, and Target Trial Emulation—are not in competition but serve different roles. They agree on the fundamental logic of counterfactual causation and the need for explicit identification assumptions. They disagree on which tools are most practical: SCM advocates emphasize graphical identification and nonparametric reasoning; Potential Outcomes advocates emphasize estimation and the assignment mechanism; Targeted Learning advocates emphasize flexible estimation and machine learning; and Target Trial Emulation advocates emphasize design transparency. This diversity is a strength: each framework brings a different lens, and the best research often combines insights from several. The central challenge—distinguishing causation from association in imperfect observational data—remains, but the field now has a rich and rigorous set of tools to meet it.