Before the early 1990s, anyone trying to compare health across countries or over time faced a frustrating obstacle. Data on causes of death, illness, and disability were collected using different definitions, different age groupings, and different diagnostic criteria. A death from a heart attack in one country might be recorded as a death from old age in another. A child disabled by a parasitic infection in a low-income setting might simply not appear in any health statistic at all. The central problem that gave rise to the subfield of global health metrics and evaluation was this lack of comparability. How could policymakers, researchers, and international agencies make rational decisions about health priorities when the basic facts about who was sick and dying were not known with any confidence? The answer required building a new science of measurement—one that could produce consistent, comparable estimates of health loss across populations, and then evaluate whether the interventions designed to reduce that loss actually worked.
The first major framework to address the comparability problem was the Global Burden of Disease (GBD) Framework, launched in the early 1990s by a collaboration led by the World Bank and the World Health Organization. Its core innovation was the Disability-Adjusted Life Year, or DALY. The DALY combined years of life lost due to premature death with years lived with disability, weighted by the severity of the condition. This single metric made it possible, for the first time, to compare the health impact of a fatal infectious disease like tuberculosis with a non-fatal but disabling condition like depression or blindness. The GBD Framework was fundamentally descriptive: it aimed to measure the size of health problems, not to explain their causes or to test whether specific programs had reduced them. Its power lay in creating a common language for health loss that cut across diseases, countries, and time periods. By producing consistent estimates for every major cause of health loss in every country, the GBD Framework revealed that mental health conditions, injuries, and non-communicable diseases were far larger contributors to the global burden than most policymakers had assumed. This descriptive picture became the foundation for priority-setting in global health, but it also created a new demand: if the burden was now visible, how could anyone know whether the money spent on fighting it was making a difference?
The GBD Framework’s estimates were only as good as the data feeding into them, and that data was often sparse, biased, or entirely absent. This reality gave rise to the Data Quality and Comparability Paradigm, a framework that treats the production of reliable, comparable health data as a distinct intellectual and practical challenge. Unlike the GBD Framework, which focused on modeling health loss, this paradigm focuses on the inputs: how to improve vital registration systems, how to standardize survey instruments, how to adjust for known biases in hospital records, and how to assess the uncertainty around every estimate. The Data Quality and Comparability Paradigm is not a single method but a set of ongoing debates and technical standards. One central tension within it is between primary data collection and statistical modeling. Some researchers argue that the priority should be strengthening country-level data systems—civil registration, health facility information systems, and household surveys—so that estimates are grounded in observed facts. Others, particularly those working within the GBD Framework, argue that modeling is necessary to fill gaps and correct biases, and that waiting for perfect primary data would mean making decisions in the dark for decades. This tension between data sovereignty and centralized modeling runs through the entire subfield. The Data Quality and Comparability Paradigm is not a competitor to the GBD Framework; it is the infrastructure on which the GBD Framework depends, and the source of many of the criticisms directed at it.
While the GBD Framework answered the question "How much health is lost?", a different set of questions soon emerged: "Did a specific program or policy reduce that loss?" and "How much of the change was caused by the intervention rather than by other factors?" Answering those questions required a shift from descriptive modeling to causal inference, and that shift defined the Evaluation and Impact Assessment Framework. Emerging alongside the GBD Framework in the 1990s, this framework drew on methods from economics, epidemiology, and biostatistics—randomized controlled trials, difference-in-differences, instrumental variables, and regression discontinuity designs—to isolate the causal effect of health programs. The Evaluation and Impact Assessment Framework coexists with the GBD Framework in a relationship of productive tension. The GBD Framework provides the overall picture of health loss; the Evaluation Framework tests whether specific investments actually change that picture. But the two frameworks operate on different logics. The GBD Framework is comfortable with complex statistical models that borrow strength across countries and time to produce a single best estimate. The Evaluation Framework prioritizes internal validity, often at the cost of generalizability, and is suspicious of models that cannot be validated against a clear counterfactual. This disagreement is not resolved; it is a living feature of the subfield. Researchers who focus on impact evaluation often argue that the GBD Framework’s estimates are too uncertain to guide resource allocation, while GBD researchers counter that evaluation studies are too narrow and too rare to inform the big-picture decisions that global health requires.
By the early 2000s, a third kind of question had become urgent: not just how much health was lost, or whether a specific program worked, but whether the entire health system was functioning well. The Health Systems Performance Assessment Framework emerged to address this question. It broadened the focus from disease-specific outcomes to system-level attributes such as access, equity, efficiency, responsiveness, and financial protection. This framework was closely tied to the movement for Universal Health Coverage (UHC), which required metrics that could track whether people could obtain needed services without suffering financial hardship. The Health Systems Performance Assessment Framework differs from the GBD Framework in its object of measurement: the GBD Framework measures health outcomes (deaths and disabilities), while the Health Systems Framework measures system processes and outputs (coverage of services, out-of-pocket spending, waiting times, distribution of resources). The two frameworks overlap, however, in their reliance on comparable data. A health system cannot be assessed without knowing the burden of disease it is meant to address, and the GBD Framework provides that baseline. Conversely, the GBD Framework’s estimates of health loss are shaped by the performance of health systems, so understanding system performance helps explain why burden patterns differ across countries. The Health Systems Performance Assessment Framework also shares a tension with the Evaluation and Impact Assessment Framework: system-level metrics are often too aggregated to reveal whether a specific reform caused an observed improvement, and evaluation purists argue that system-level comparisons are vulnerable to confounding.
The ambition of the GBD Framework—to produce consistent estimates for every cause of health loss in every country, every year—required an organizational and computational apparatus far beyond what any single research group had previously assembled. That apparatus became the Health Metrics and Evaluation Infrastructure, most visibly embodied by the Institute for Health Metrics and Evaluation (IHME), founded in 2007 at the University of Washington. The Infrastructure framework is not a set of methods or metrics in itself; it is the institutional and technical platform that makes the GBD Framework’s production cycle possible. IHME developed standardized data-processing pipelines, Bayesian statistical models, and a massive repository of input data from vital registration, surveys, and disease registries. It also created a regular publication cycle—the annual GBD updates—that gave the estimates a rhythm and a public presence. The Health Metrics and Evaluation Infrastructure transformed the GBD Framework from an occasional academic exercise into a continuous, globally visible enterprise. But it also intensified the tensions already present in the subfield. The Infrastructure’s centralized modeling approach, in which a single team in Seattle produces estimates for every country, has been criticized by advocates of the Data Quality and Comparability Paradigm who argue that it discourages investment in local data systems. Country health officials sometimes find that IHME’s estimates differ from their own administrative data, leading to disputes about which numbers should guide policy. The Infrastructure framework is thus both the subfield’s greatest strength—producing estimates that no single country could produce alone—and a source of ongoing disagreement about who should control the production of health statistics.
Today, all five frameworks remain active, and the subfield is defined less by a single dominant approach than by a division of labor among them. The GBD Framework and the Health Metrics and Evaluation Infrastructure together provide the most widely used global estimates of health loss, and the DALY has become a standard metric in health priority-setting. The Data Quality and Comparability Paradigm has gained institutional recognition, with major initiatives like the Global Financing Facility and the Health Data Collaborative pushing for stronger country data systems. The Evaluation and Impact Assessment Framework has become a required component of major global health funding, with organizations like the Global Fund and the World Bank demanding rigorous evidence of impact. The Health Systems Performance Assessment Framework has been institutionalized in the UHC monitoring framework led by the World Health Organization and the World Bank.
What the leading frameworks agree on is that health metrics must be transparent, that uncertainty should be quantified, and that data quality is a prerequisite for any meaningful analysis. There is broad consensus that the DALY, despite its limitations, is the best available common unit for comparing health loss across conditions. The major disagreements are three. First, the tension between descriptive modeling and causal evaluation: should global health resources be allocated based on the size of the burden (GBD) or on the proven effectiveness of interventions (Impact Evaluation)? Second, the tension between centralized and decentralized data production: should estimates be produced by a single global institution with sophisticated models, or should they be built from the ground up by countries with their own data systems? Third, the tension between disease-specific and system-level metrics: should the field focus on tracking outcomes for individual conditions, or on measuring the overall performance of health systems? These disagreements are not signs of weakness; they are the productive frictions that drive the subfield forward, forcing researchers and policymakers to make their assumptions explicit and to defend their choices with evidence.