Epidemiologic modeling has always been pulled between two ambitions: to understand the mechanisms that drive disease through populations and to extract reliable patterns from messy, incomplete data. The first ambition demands equations that mimic biological and social processes; the second demands statistical tools that let the data speak without imposing a rigid mechanistic story. Over the past century, modelers have built and rebuilt frameworks for balancing these tasks, each framework shaped by the diseases that pressed on it and the computational resources available at the time. The result is not a single victorious paradigm but a practical pluralism in which different frameworks coexist, compete, and sometimes combine.
The subfield's founding framework emerged from a concrete problem: how could a few simple equations capture the rise and fall of an epidemic? In 1927, William Ogilvy Kermack and Anderson Gray McKendrick proposed a model that divided a population into compartments—Susceptible, Infectious, Recovered (SIR)—and wrote differential equations for the flow between them. The model was deterministic: given the same starting conditions, it always produced the same trajectory. Its great achievement was to reveal threshold phenomena, most famously the basic reproduction number R₀, the average number of secondary cases generated by a single infectious person in a fully susceptible population. When R₀ exceeds 1, an epidemic can take off; when it falls below 1, the outbreak fades. This mechanistic logic gave public health a powerful conceptual lever: interventions that reduce R₀ below 1 could, in principle, stop an epidemic.
Yet the framework's strength was also its limitation. Deterministic compartmental models assumed that populations were well-mixed, that individuals within a compartment were identical, and that chance played no role. These assumptions worked well for large, homogeneous populations and fast-spreading pathogens, but they broke down for small populations, rare diseases, or settings where individual variation mattered. The framework did not disappear—it remains a workhorse for pandemic planning and basic teaching—but its simplifying commitments created pressure for alternatives.
By the 1950s, two very different responses to the deterministic paradigm had taken shape. Each addressed a distinct limitation, and together they opened a lasting divide between mechanistic and empirical modeling traditions.
Stochastic models kept the mechanistic core of compartmental thinking but added randomness. Instead of predicting a single epidemic curve, they treated transitions between compartments as probabilistic events. This allowed modelers to capture the role of chance in small populations, the extinction of outbreaks before they take off, and the variability between repeated realizations of the same epidemic process. Stochastic models did not replace deterministic ones; they coexisted with them, often used for the same diseases but with an explicit acknowledgment of uncertainty. The relationship was one of narrowing: stochastic models preserved the mechanistic logic of compartments while relaxing the assumption of perfect predictability.
Regression-based empirical models took a fundamentally different path. Instead of starting with a mechanistic story about transmission, they began with data—case counts, risk factors, environmental variables—and used statistical regression to identify associations. The goal was not to simulate the disease process but to estimate the strength of relationships and make predictions. This framework emerged from the broader tradition of statistical epidemiology and was especially useful for chronic diseases, where the mechanistic details of transmission were less relevant than the identification of risk factors. Regression models did not require modelers to specify how the disease moved; they required only that the data be structured enough to fit a linear or logistic equation. The contrast with deterministic compartmental models could hardly be sharper: one tradition built models from first principles about mechanisms, the other built models from patterns in data.
By the 1990s, both the mechanistic and empirical traditions had grown more sophisticated, but each faced a new challenge: how to handle heterogeneity. Populations are not uniform, and data come from multiple levels—individuals, households, neighborhoods, regions. Two frameworks emerged to address these complexities, each extending one side of the earlier divide.
Agent-based models (ABMs) pushed the mechanistic tradition toward extreme granularity. Instead of dividing a population into a few compartments, ABMs simulated each individual as a unique agent with its own attributes, behaviors, and contacts. The modeler specified rules for how agents interacted and how infection spread, then let the simulation run to see what patterns emerged. ABMs could represent heterogeneity in age, mobility, social networks, and intervention compliance in ways that compartmental models could not. Their cost was computational intensity and the difficulty of calibrating so many parameters. ABMs did not reject the mechanistic ambition; they fulfilled it more completely by building the population from the ground up.
Bayesian hierarchical models took the empirical tradition in a different direction. They used Bayesian statistics to combine data from multiple levels—individual outcomes nested within clinics, clinics within regions—while explicitly modeling uncertainty at each level. The hierarchical structure allowed information to be shared across groups, improving estimates for small or sparse subgroups. Bayesian models did not require a mechanistic transmission story; they required a careful specification of prior distributions and a computational engine (often Markov chain Monte Carlo) to sample from the posterior. They extended the regression-based tradition by making it more flexible and more honest about uncertainty, but they remained firmly in the empirical camp: the goal was inference from data, not simulation of mechanisms.
The 2000s brought two more frameworks that pushed the mechanistic-empirical tension in opposite directions.
Network models focused on the topology of contacts. Instead of assuming random mixing or simulating every agent attribute, they represented the population as a graph whose edges captured who could infect whom. The structure of the graph—whether it was scale-free, small-world, or clustered—determined how quickly and widely a pathogen could spread. Network models were mechanistic in spirit: they specified a structural mechanism (contact topology) that shaped transmission. But they differed from compartmental models by foregrounding heterogeneity in connectivity, and they differed from ABMs by abstracting away individual attributes to focus on relational patterns. Network models revealed that highly connected individuals ("superspreaders") could drive outbreaks even when average transmission was low, a finding that had direct implications for targeted interventions.
Machine learning approaches took the empirical tradition to its logical extreme. Instead of specifying a regression equation or a hierarchical structure, they used algorithms—random forests, support vector machines, neural networks—to learn patterns directly from data, often with minimal assumptions about the underlying process. Machine learning excelled at prediction, especially when data were high-dimensional and relationships were nonlinear. But it came with a cost: the models were often opaque, making it difficult to explain why a prediction was made or to infer causal mechanisms. Machine learning did not replace regression-based models; it coexisted with them, often used for early warning systems, outbreak detection from social media, or risk stratification, while regression remained preferred when interpretability was paramount.
Today, no single framework dominates epidemiologic modeling. The field has settled into a practical division of labor shaped by the question at hand. Deterministic compartmental models remain the first tool for rapid pandemic assessment because they are transparent, fast, and grounded in a mechanistic logic that policymakers can grasp. Stochastic models are preferred when uncertainty and chance events matter, such as in the early phase of an outbreak or for small populations. Agent-based models are used when heterogeneity in behavior, mobility, or contact patterns is central to the question. Bayesian hierarchical models are the workhorse for analyzing surveillance data with complex spatial or temporal structure. Network models guide interventions targeting superspreaders or vulnerable groups. Machine learning approaches power real-time prediction and anomaly detection from diverse data streams.
What the leading frameworks agree on is that no single model can answer every question. The most influential work today often combines frameworks: a compartmental model may be embedded in a Bayesian hierarchical structure to estimate parameters from data, or an agent-based model may be calibrated using machine learning to match observed patterns. This ensemble approach acknowledges that the mechanistic-empirical tension is not a problem to be solved but a productive tension to be managed.
What they disagree on is how much mechanistic transparency to sacrifice for predictive accuracy. Proponents of mechanistic models argue that understanding why an epidemic behaves as it does is essential for designing interventions that will work under changed conditions. Proponents of machine learning counter that when data are abundant and the system is complex, a black-box predictor may outperform any mechanism-based model, and that understanding can come after prediction. This disagreement is not likely to be resolved; it reflects a deeper philosophical divide about what modeling is for. The field's vitality depends on keeping both traditions alive, letting each push the other toward greater rigor and relevance.