How can we ensure that increasingly capable artificial intelligence systems remain beneficial to humanity? This question has driven a young, rapidly evolving subfield that is less a settled discipline than a collection of competing research programs, each with its own diagnosis of the core problem and its own preferred solution. The central tension running through AI safety is between building rigorous, abstract theories of aligned intelligence and developing practical, empirical methods that work with today's machine learning systems. This tension has shaped a sequence of frameworks that sometimes build on each other, sometimes react against each other, and sometimes coexist in productive disagreement.
The earliest systematic research program in AI safety, Agent Foundations, emerged from the Machine Intelligence Research Institute (MIRI) in the mid-2000s. Its core commitment was that aligning advanced AI requires a formal, mathematical theory of intelligent agency itself. Researchers in this tradition sought to define what it means for an AI system to reliably pursue human goals, drawing on decision theory, game theory, and formal logic. The distinctive contribution of Agent Foundations was to treat alignment as a theoretical problem first: before building powerful AI, we should understand the abstract principles that guarantee a system's behavior remains under human control. This framework produced influential concepts such as "coherent extrapolated volition"—a formal target for what humans would want an AI to do if we were more informed and reflective. Agent Foundations remains an active minority tradition, arguing that empirical shortcuts will fail for the most capable future systems.
Building directly on the concerns raised by Agent Foundations, Long-Term AI Risk Analysis shifted the focus from formal guarantees to strategic dynamics. Rather than asking how to prove an AI system is safe, this framework asks: what are the large-scale risks posed by advanced AI, and how should humanity prepare? It inherited from Agent Foundations the assumption that future AI could be extremely powerful and potentially dangerous, but it broadened the inquiry to include geopolitical competition, arms races, and the difficulty of coordinating global actors. The relationship between the two frameworks is one of infrastructure: Agent Foundations provided the conceptual tools for thinking about misaligned superintelligence, while Long-Term Risk Analysis applied those tools to real-world scenarios. This framework remains active, particularly in policy-oriented research, and coexists with more engineering-focused approaches by insisting that the most severe risks may come from systems far beyond today's capabilities.
Capability Control represents a narrowing of the safety problem from abstract alignment to practical containment. Instead of trying to make an AI want what we want, this framework asks: how can we restrict what a powerful AI can do, even if we cannot fully align its goals? The core idea is to build external constraints—tripwires, sandboxes, restricted information access, and oversight mechanisms—that prevent a system from causing harm regardless of its internal motivations. This approach contrasts sharply with Agent Foundations: where Agent Foundations sought to solve the alignment problem from first principles, Capability Control treats it as an engineering challenge of building reliable cages. The framework gained traction after Nick Bostrom's book Superintelligence popularized the "control problem," and it remains influential in discussions of how to safely deploy powerful AI systems today, especially in high-stakes domains like autonomous weapons or critical infrastructure.
Interpretability emerged as a methodological school focused on understanding what neural networks are actually doing internally. Its distinctive contribution was to treat transparency as a safety tool: if we can read a model's internal representations, we can detect deception, bias, or goal misalignment before it causes harm. Early work, such as layer-wise relevance propagation, aimed to explain individual decisions of deep networks. Interpretability initially developed alongside other safety frameworks, but it was soon absorbed into a larger empirical program. Unlike Capability Control, which builds external barriers, Interpretability tries to open the black box. Its relationship to later frameworks is one of absorption: it became a tool within Empirical Alignment rather than remaining an independent paradigm.
Empirical Alignment marked a decisive reaction against the theoretical orientation of Agent Foundations. Researchers in this tradition argued that alignment should be studied experimentally with today's machine learning systems, not through abstract proofs about hypothetical future AIs. The landmark paper "Concrete Problems in AI Safety" (2016) listed five specific, testable failure modes—such as reward hacking and distributional shift—that could be studied in current reinforcement learning agents. This framework's core commitment is that alignment research must be grounded in empirical practice: train a model, observe its failures, and iterate on solutions. Empirical Alignment subsumed both Interpretability and Scalable Oversight as methodological tools within its broader experimental program. It became the dominant framework in the field because it aligned with the incentives of academic AI research: it produces publishable results, works with existing systems, and attracts funding from major labs like DeepMind and OpenAI. Today, most active AI safety researchers work within this paradigm, using techniques such as reinforcement learning from human feedback (RLHF) and red-teaming to probe and correct model behavior.
Robustness and Assurance addresses a complementary problem to alignment: even a perfectly aligned AI can fail if it encounters inputs it was not trained on. This framework focuses on making AI systems reliable through formal verification, adversarial training, and distributional robustness. Its relationship to Capability Control is instructive: both are engineering approaches, but Capability Control builds external constraints while Robustness and Assurance builds internal reliability. A verified neural network that provably satisfies safety constraints is a different kind of guarantee than a sandboxed agent. Robustness and Assurance remained independent of Empirical Alignment because its methods—formal verification, certification, and worst-case analysis—require a different mathematical toolkit than the empirical loop of training and testing. It persists as a minority tradition, strongest in safety-critical applications like autonomous driving and medical diagnosis.
Multi-Agent Safety extends the safety problem from single systems to interactions between multiple AI agents. Its distinctive contribution is to recognize that many real-world risks arise from strategic dynamics between agents—competition, deception, arms races—rather than from any single misaligned system. This framework draws on game theory and multi-agent reinforcement learning, and it shares concerns with Long-Term AI Risk Analysis about systemic rather than individual failure modes. However, Multi-Agent Safety is narrower in scope: it focuses on formal models of agent interactions rather than global geopolitical strategy. It coexists with Empirical Alignment by providing a different level of analysis: while Empirical Alignment asks how to train a single helpful model, Multi-Agent Safety asks what happens when many such models interact in economic or military settings.
Scalable Oversight addresses a specific bottleneck in Empirical Alignment: how can humans supervise AI systems that are smarter than themselves? The core problem is that as models become more capable, human feedback becomes less reliable—a human cannot easily evaluate the output of a superhuman chess engine or a medical diagnosis system that outperforms doctors. Scalable Oversight proposes methods to amplify human judgment, such as debate (where two AIs argue a question before a human judge) and recursive reward modeling (where a weak overseer trains a stronger one). This framework was quickly absorbed into Empirical Alignment as a sub-problem, but it retains a distinct identity because it tackles a conceptual challenge that pure empirical iteration cannot solve: the fundamental asymmetry between human evaluators and increasingly capable systems.
Today, AI safety is a pluralistic field with several active frameworks. Empirical Alignment is dominant, especially in industry labs, because it produces measurable progress on current systems and attracts institutional support. Agent Foundations and Long-Term AI Risk Analysis persist as minority traditions, arguing that empirical methods will not scale to superhuman AI and that foundational theory remains essential. Robustness and Assurance maintains a separate track focused on formal guarantees, while Multi-Agent Safety and Capability Control address specific risk categories that Empirical Alignment does not fully cover.
The leading frameworks agree on several points: that advanced AI poses genuine risks, that safety should be studied proactively, and that no single approach is sufficient. They disagree sharply on methodology—whether to prioritize theory or experiment—and on timescale: should we focus on risks from systems that exist today, or prepare for transformative AI decades in the future? The deepest disagreement is about whether alignment is fundamentally a technical problem that can be solved with better algorithms, or a strategic problem that requires changes in how AI is developed and governed. This tension is unlikely to resolve soon, and the field's vitality depends on maintaining multiple approaches that challenge each other's assumptions.