System safety emerged from a practical pressure that still defines it: how do you prevent catastrophic failures in complex engineered systems—missiles, nuclear reactors, aircraft, medical devices—where a single mistake can kill people and destroy billions in investment? The answer has never been a fixed technique. Over seven decades, system safety has produced a sequence of analytical frameworks, each responding to the blind spots of its predecessors while often preserving their useful tools. The result is a field that today contains multiple, partly competing approaches that coexist in different industries and regulatory regimes.
The earliest systematic method, Failure Mode and Effects Analysis (FMEA), appeared in the late 1940s. FMEA is an inductive, bottom-up technique: it lists every component in a system, imagines how that component could fail, and traces the local consequences of that failure. Its unit of analysis is the component, and its logic is a simple cause–effect chain. FMEA was effective for hardware-dominated systems where failures were mechanical or electrical, but it struggled with combinations of failures and with systems where multiple components interact in unexpected ways.
Fault Tree Analysis (FTA), developed at Bell Labs in 1961 for the Minuteman missile program, approached the same problem from the opposite direction. FTA is deductive and top-down: it starts with a top-level undesired event (an explosion, a loss of containment) and works backward through Boolean logic gates to identify the combinations of basic events that could produce it. Where FMEA asks "what happens if this part fails?", FTA asks "what would have to go wrong for this disaster to occur?" The two methods are complementary rather than competitive. Practitioners routinely use FTA to identify critical failure paths and then use FMEA to ensure that every component on those paths has been examined. Event Tree Analysis (ETA), developed around 1970, extended the logic forward: it starts with an initiating event and branches through the success or failure of safety barriers, mapping the possible outcomes. Together, FMEA, FTA, and ETA gave engineers a toolkit for modeling accident sequences as chains of discrete events.
By the mid-1970s, the nuclear power industry faced a regulatory demand that earlier qualitative methods could not satisfy: regulators wanted to know not just what could go wrong, but how likely it was. Probabilistic Risk Assessment (PRA), formalized in the 1975 Reactor Safety Study (WASH-1400), integrated FTA and ETA into a quantitative framework. PRA assigns probabilities to basic events—component failure rates, human error probabilities—and propagates them through fault trees and event trees to compute the overall probability of a core damage accident. This quantification made risk visible to decision-makers and regulators, but it also introduced new assumptions: that failure probabilities are stable, that events are independent, and that the model captures all significant scenarios. PRA became the dominant framework in nuclear engineering and later spread to aerospace, chemical processing, and other high-hazard industries. It did not replace FMEA or FTA; it absorbed them as analytical engines within a larger probabilistic structure.
By the 1990s, a different pressure was building. Regulators in the UK and Australia, particularly in defense and offshore oil, found that a collection of PRA results, FTA diagrams, and test reports did not by itself demonstrate that a system was acceptably safe. What was missing was a structured argument linking evidence to a safety claim. The Safety Case framework, emerging around 1990, addressed this gap. A Safety Case is a documented body of evidence and reasoning that argues a system is safe for a given application in a given environment. It does not replace PRA or FTA; it organizes their outputs into a coherent argument, often using graphical notations such as the Goal Structuring Notation (GSN). The Safety Case framework shifted the focus from producing analyses to justifying them—making the reasoning behind safety claims explicit and auditable. It has since become mandatory in several regulated industries, including UK defense, European rail, and offshore oil and gas.
By the early 2000s, a deeper limitation of the entire event-chain tradition had become visible. FMEA, FTA, ETA, and PRA all model accidents as sequences of component failures or human errors. But many modern accidents—the 1996 Ariane 5 explosion, the 2003 Northeast blackout, the 2010 Deepwater Horizon blowout—did not involve component failures in the traditional sense. They involved software errors, dysfunctional interactions between normally functioning components, and the gradual erosion of safety constraints in complex socio-technical systems. Nancy Leveson's System-Theoretic Accident Model and Processes (STAMP), introduced in 2002, offered a fundamentally different ontology. STAMP treats safety not as a property of components but as a control problem: accidents occur when system-level safety constraints are violated because of inadequate control or enforcement. The unit of analysis is the control loop, not the component. STAMP's associated analysis method, System-Theoretic Process Analysis (STPA), identifies unsafe control actions and the scenarios that could lead to them. Where FTA would ask "which component failures caused the accident?", STPA asks "which control actions were missing or incorrectly provided, and why?" STAMP does not reject event-chain methods as useless; it argues that they are incomplete for software-intensive, tightly coupled systems. In practice, STPA has been adopted in aerospace, defense, and automotive safety (including ISO 21448 for autonomous vehicles), often alongside traditional FTA and FMEA.
The most recent major shift in system safety thinking began around 2006 with Resilience Engineering. This framework emerged from the observation that high-hazard systems such as aviation, nuclear power, and healthcare rarely fail, even though they operate under constant variability and pressure. Resilience Engineering argues that safety is not primarily the absence of failure but the ability to anticipate, monitor, respond to, and learn from disturbances. It studies how organizations and people adapt to real-world conditions that differ from procedures and designs. The focus moves from preventing rare catastrophes to sustaining normal performance under uncertainty.
Safety-II, articulated by Erik Hollnagel in 2014, sharpened this perspective into a distinct epistemological claim. Safety-II argues that the traditional approach (now called Safety-I) treats safety as the absence of negative events and studies only incidents and accidents. Safety-II instead studies everyday performance—the vast majority of operations that go well—and treats safety as the presence of adaptive capacity. Where Safety-I asks "why did it fail?", Safety-II asks "why does it usually succeed?" The two frameworks share the systemic, socio-technical orientation of Resilience Engineering, but Safety-II makes a stronger claim: that the mechanisms that produce success are not simply the inverse of those that produce failure, and that learning from normal work is essential. Resilience Engineering and Safety-II are not replacements for PRA or STAMP; they operate at a different level, focusing on organizational and operational dynamics rather than on design-time analysis. They have been influential in healthcare, air traffic management, and process safety, where human adaptation is central to system performance.
Today, system safety contains multiple active frameworks that serve different purposes and coexist in different sectors. FMEA, FTA, and ETA remain the workhorses of hardware safety analysis in automotive, aerospace, and industrial equipment. PRA is the regulatory standard for nuclear power and is widely used in space launch and chemical process safety. Safety Cases are mandatory in defense, rail, and offshore oil, and they increasingly incorporate outputs from PRA, FTA, and STPA as evidence. STAMP/STPA is growing rapidly in software-intensive domains, especially autonomous systems, where event-chain models are demonstrably inadequate. Resilience Engineering and Safety-II are shaping safety management practices in healthcare, aviation operations, and high-reliability organizations, often complementing rather than replacing design-stage analyses.
The leading frameworks today agree on several points: safety is an emergent system property, not a sum of component reliabilities; human error is a symptom of system design, not a root cause; and safety analysis must address software, human behavior, and organizational factors, not just hardware failures. They disagree sharply on what the fundamental unit of analysis should be—component failure (FTA), event probability (PRA), control constraint (STAMP), or adaptive capacity (Safety-II)—and on whether quantification is essential or misleading. This productive tension means that a safety engineer today must be fluent in multiple frameworks and know which one fits the system, the hazard, and the regulatory context. The field has not converged on a single method, and it probably never will: the systems it studies are too diverse for that.
In practice, the choice of framework depends on the industry and the regulatory environment. Nuclear power relies on PRA because regulators require quantified risk numbers. Aerospace uses a mix of FTA, FMEA, and increasingly STPA for software-intensive subsystems. Automotive safety standards (ISO 26262) are built around FTA and FMEA, but the new standard for autonomous vehicles (ISO 21448) explicitly adopts STPA. Healthcare and aviation operations draw on Resilience Engineering and Safety-II to manage the gap between work-as-imagined and work-as-done. Defense and rail require Safety Cases that integrate evidence from multiple analyses. The frameworks are not in a winner-take-all competition; they are a toolkit, and the skill of the safety engineer lies in selecting and combining them appropriately.