Every empirical study in economics that asks whether a policy, program, or event caused an outcome faces the same fundamental problem: the researcher can never observe what would have happened to the same unit in the same situation without the treatment. This missing counterfactual is the central tension of causal inference. Over the past six decades, econometricians have developed a family of design-based frameworks that address this problem by exploiting different sources of variation in observational data. Each framework makes a distinct set of assumptions about how the missing counterfactual can be reconstructed, and each has its own strengths and limitations.
In 1974, Donald Rubin introduced a way of thinking about causal questions that would eventually become the conceptual backbone of the entire design-based tradition. The Potential Outcomes Framework reframes every causal claim as a comparison between two hypothetical states: the outcome if a unit receives treatment and the outcome if the same unit does not. Since only one of these is ever observed, the framework makes explicit what must be assumed to treat the observed comparison as a valid estimate of the causal effect. The key assumptions—stable unit treatment value (SUTVA), which rules out interference between units, and ignorability or unconfoundedness, which requires that treatment assignment be independent of potential outcomes conditional on observed covariates—are not always plausible, but stating them forces researchers to be transparent about what their identification strategy requires. Before this framework, causal language in econometrics was often embedded in structural equation models without a clear separation between the statistical assumptions and the economic theory. The Potential Outcomes Framework did not replace structural modeling; instead, it provided a notation and a set of logical criteria that made it possible to compare different identification strategies on common ground.
Long before the Potential Outcomes Framework was formalized, a different logic was already in use. In 1960, Donald Campbell proposed Regression Discontinuity Design (RDD) as a method for evaluating programs where treatment is assigned by a cutoff score on a continuous variable. Students who score above a certain threshold on a test receive a scholarship; those just below do not. The core insight is that units very close to the cutoff are essentially comparable, so the discontinuity in outcomes at the threshold can be interpreted as the causal effect of the treatment. RDD was initially developed in psychology and education research, but it was later reinterpreted through the lens of potential outcomes: the cutoff creates a local randomization, and the identifying assumption is that the relationship between the assignment variable and the potential outcomes is smooth around the threshold. Compared to the Potential Outcomes Framework, which requires conditioning on all confounders, RDD exploits a known institutional rule to generate quasi-random variation. This makes RDD one of the most credible designs in the causal inference toolkit, though its results generalize only to units near the cutoff.
When no cutoff or instrument is available, researchers must rely on the assumption that all confounders are measured. This is the logic behind Selection-on-Observables and Matching, which took a major step forward with Paul Rosenbaum and Donald Rubin's 1983 paper on the propensity score. The propensity score—the probability of receiving treatment given observed covariates—allows researchers to match treated and untreated units on a single scalar rather than on many covariates individually. The identifying assumption is that treatment assignment is ignorable given the covariates, meaning that within cells defined by the propensity score, the treatment is as good as randomly assigned. This framework differs from RDD in the type of variation it exploits: RDD uses a known discontinuity, while matching relies on the richness of the covariate set to eliminate confounding. The cost is that any unmeasured confounder can invalidate the results. Matching coexists with RDD as a complementary strategy; when a credible cutoff exists, RDD is usually preferred, but when no such rule is available, matching offers a systematic way to adjust for observable differences.
A different approach to unobserved confounders emerged in the mid-1980s. In 1985, Orley Ashenfelter and David Card used longitudinal data on earnings to evaluate a training program by comparing the change in earnings for trainees to the change for a comparison group. This Difference-in-Differences (DiD) design builds on the logic of selection on observables but relaxes a key assumption: instead of requiring that all confounders be measured, DiD only requires that any unobserved confounders are time-invariant and affect both groups in the same way. The identifying assumption is the parallel-trends condition: in the absence of treatment, the average outcomes for the treated and comparison groups would have followed the same path over time. DiD thus exploits temporal variation that matching alone cannot capture. The framework was soon extended to natural experiments—events that assign treatment in a way that mimics randomization, such as policy changes or natural disasters. David Card's 1990 study of the Mariel Boatlift, which examined the effect of a sudden influx of Cuban immigrants on the Miami labor market, became a landmark example of the natural-experiment approach. DiD remains one of the most widely used causal designs in applied economics, though its credibility depends heavily on the plausibility of the parallel-trends assumption, which has been the subject of intense methodological scrutiny in recent years.
Instrumental variables (IV) have a long history in econometrics, but their interpretation was transformed in the early 1990s. In 1994, Joshua Angrist and Guido Imbens introduced the Local Average Treatment Effect (LATE) framework, which reinterprets the IV estimand as the average treatment effect for the subpopulation of compliers—those whose treatment status is changed by the instrument. This was a departure from earlier practice, which often assumed that the IV estimand captured the average treatment effect for the entire population. The LATE framework made explicit that IV estimates are local to the compliers and do not necessarily generalize to never-takers or always-takers. This narrowing of the interpretation was not a rejection of IV but a clarification of what IV actually identifies under heterogeneous treatment effects. The LATE framework also clarified the assumptions required for IV: the instrument must be randomly assigned or as good as random, it must affect the treatment, it must affect the outcome only through the treatment (exclusion restriction), and there must be no defiers (monotonicity). Compared to DiD, which exploits time variation, IV exploits variation in an external instrument that shifts treatment assignment. The LATE framework has become the standard way to teach and apply IV in causal inference, and it remains an active area of research, particularly regarding the interpretation and external validity of LATE estimates.
When a single unit receives a treatment—a state passes a law, a country experiences a conflict—neither DiD nor matching provides a natural comparison group. In 2003, Alberto Abadie and Javier Gardeazabal introduced Synthetic Control Methods (SCM) to address this problem. SCM constructs a counterfactual for the treated unit as a weighted average of untreated units (the donor pool), where the weights are chosen so that the synthetic control matches the treated unit's pre-treatment outcomes and covariates as closely as possible. The identifying assumption is that the weighted combination of donor units provides a credible estimate of the counterfactual path in the absence of treatment. SCM extends the logic of DiD to settings with a single treated unit and a small number of comparison units, but it imposes additional data requirements: the researcher needs a long pre-treatment period and a donor pool of units that are unaffected by the treatment. The 2003 study of the economic costs of conflict in the Basque Country became a canonical application. SCM has since been refined with placebo tests and inference procedures, and it is now a standard tool in comparative case-study research.
All six frameworks remain active today, and they are often used in combination. The Potential Outcomes Framework provides the common language for stating assumptions, while RDD, matching, DiD, IV, and SCM offer specific identification strategies suited to different data structures and institutional settings. There is broad agreement that credible causal inference requires a clear statement of the identifying assumptions and a sensitivity analysis to assess how violations might affect the conclusions. The leading disagreement is between the design-based tradition, which prioritizes transparent assumptions and local identification, and the structural estimation tradition, which builds explicit economic models to recover deeper parameters. Within the design-based camp itself, there is ongoing debate about the scope of LATE estimates, the validity of parallel-trends assumptions in DiD, and the conditions under which synthetic controls provide reliable inference. These debates are productive: they push researchers to be more precise about what their designs can and cannot identify, and they continue to generate new methods that refine the tools available for learning about cause and effect from observational data.