Building a computer system that never fails is impossible. Components wear out, software contains latent bugs, networks drop packets, and human operators make mistakes. The central question of reliability and fault tolerance is not how to eliminate failure entirely, but how to design systems that continue to deliver correct service despite faults, and how to recover quickly when they do fail. Over the past seven decades, researchers and engineers have developed eight major frameworks that approach this problem from different angles, each building on, narrowing, or reacting against the ideas that came before.
The earliest systematic approach to reliability was to duplicate or triplicate hardware components so that a single failure would not bring down the whole system. In the 1950s and 1960s, when computers used vacuum tubes and early transistors, component failure was frequent. Engineers responded with techniques such as triple modular redundancy (TMR), where three identical units perform the same computation and a majority voter masks any single faulty output. This framework treated faults as physical defects in hardware and assumed that failures were independent and random. Hardware redundancy was expensive—it roughly tripled the cost of a system—but it was the only option available when software was simple and ran on a single processor. The approach coexisted with the earliest mainframe designs and remained dominant through the 1960s, especially in aerospace and military applications where failure was unacceptable.
By the 1960s, system designers realized that reliability alone was too narrow a goal. A system might be reliable in the sense of not producing wrong answers, yet still be unavailable for long periods during repair. The RAS framework—standing for Reliability, Availability, and Serviceability—broadened the objective. Reliability meant continuous correct operation; availability measured the fraction of time the system was usable; serviceability captured how easily a system could be repaired or maintained. This framework did not replace hardware redundancy so much as complement it by adding new metrics and design goals. Mainframe vendors such as IBM built RAS into their product lines, introducing features like error-correcting memory, redundant power supplies, and hot-swappable components. The RAS framework gave engineers a vocabulary for trade-offs: a system could be made more available by accepting slightly lower reliability, or more serviceable by modularizing components even if that added cost. It remained the dominant industrial framework through the 1970s and influenced later thinking about system dependability.
As software grew larger and more complex, it became clear that hardware redundancy could not protect against design faults in programs. A duplicated processor running the same buggy software would produce the same wrong answer twice. Software fault tolerance emerged in the 1970s to address this gap. The key idea was to introduce diversity into the software itself. N-version programming, developed by Algirdas Avizienis and others, asked multiple independent teams to write separate implementations of the same specification; a voting mechanism then compared their outputs. Recovery blocks, another technique, allowed a primary module to attempt an operation and fall back to an alternative if the primary failed acceptance tests. This framework narrowed the focus of fault tolerance from hardware to software, but it coexisted with hardware redundancy rather than replacing it—many critical systems used both. Software fault tolerance was expensive in development cost and complexity, and it assumed that independent teams would make different mistakes, an assumption that later research showed was not always valid.
At roughly the same time, the database community developed a different approach to reliability: transaction processing. Instead of masking faults with redundancy, this framework ensured that a sequence of operations either completed entirely or had no effect at all—the atomicity property. Combined with consistency, isolation, and durability (the ACID properties), transactions gave applications a clean abstraction for handling crashes and concurrency. The transaction framework absorbed the problem of fault tolerance into a broader model of data integrity. It did not compete directly with software fault tolerance; rather, it addressed a different pressure—the need for reliable updates in shared databases. Transaction processing became the backbone of banking, airline reservations, and other online transaction processing systems. Its influence persisted through the 1990s and beyond, though its strict consistency guarantees proved difficult to maintain in large-scale distributed systems.
In 1982, Leslie Lamport, Robert Shostak, and Marshall Pease posed a new problem: how can a group of computers reach agreement when some of them may be faulty and send conflicting or malicious information? They called this the Byzantine Generals Problem. Byzantine fault tolerance (BFT) extended earlier failure models by considering arbitrary faults, including malicious behavior, not just crashes or omission errors. The framework showed that agreement was possible only if fewer than one-third of the participants were faulty, and it required multiple rounds of message exchange. BFT was a significant departure from earlier frameworks because it assumed the worst-case behavior of faulty nodes, not just random hardware failures or independent software bugs. For many years, BFT protocols were considered too expensive for practical use—they required many messages and several communication rounds. The framework remained largely theoretical until the 1990s, when researchers began building practical BFT systems such as PBFT (Practical Byzantine Fault Tolerance).
State machine replication (SMR) provided a general method for making a service fault-tolerant by replicating its state across multiple servers and ensuring that all replicas execute the same commands in the same order. The framework built directly on the agreement protocols developed for Byzantine fault tolerance, but it also worked with simpler crash-failure models. SMR transformed the problem of fault tolerance into a problem of ordering: if every replica sees the same sequence of operations, they will all reach the same final state. This approach absorbed the Byzantine fault tolerance framework's insights about agreement while narrowing its scope to a practical architecture for replicated services. Systems such as Google's Chubby lock service and Apache ZooKeeper used SMR to provide reliable coordination in large-scale distributed systems. The framework coexisted with transaction processing—both ensured consistency, but SMR focused on replicating state across machines rather than on atomic updates to a single database.
By the early 2000s, a different insight was gaining traction: no matter how much redundancy and fault tolerance you build in, failures will still happen, and the most important metric is how quickly the system can recover. Recovery-oriented computing (ROC), developed at Stanford and UC Berkeley, shifted the emphasis from preventing faults to minimizing recovery time. ROC introduced techniques such as microrebooting—restarting only a small component of a system instead of the whole machine—and undoable operations that could be rolled back after a failure. This framework reacted against the assumption that hardware redundancy and software fault tolerance could make systems sufficiently reliable. Instead, it argued that human error was the dominant cause of outages and that systems should be designed to recover quickly from operator mistakes. ROC coexists with earlier frameworks: a modern cloud service still uses hardware redundancy and transaction processing, but it also invests heavily in fast recovery mechanisms. The framework remains active today, especially in the design of internet services where availability is paramount.
The most recent framework, chaos engineering, takes the recovery-oriented philosophy one step further by deliberately injecting failures into production systems to test their resilience. Pioneered at Netflix in the early 2010s, chaos engineering treats reliability as an empirical property that must be continuously validated rather than assumed from design. Engineers run experiments—such as killing a server, introducing network latency, or corrupting data—and observe whether the system degrades gracefully. This methodological school transformed fault tolerance from a static property designed upfront into a dynamic practice of hypothesis testing and learning. Chaos engineering does not replace recovery-oriented computing; it complements it by providing a way to verify that recovery mechanisms actually work under realistic conditions. The two frameworks together represent a living tradition in which reliability is understood as an ongoing process rather than a finished state.
Today, the leading frameworks—recovery-oriented computing and chaos engineering—agree on several points. Both accept that failures are inevitable and that the primary goal should be fast recovery rather than perfect prevention. Both emphasize the importance of human operators and the need to design systems that tolerate operator mistakes. Both treat reliability as a property that must be tested and maintained over time, not a one-time achievement. They disagree, however, on the role of experimentation in production. Recovery-oriented computing focuses on building recovery mechanisms and assumes that if they are well-designed, the system will recover. Chaos engineering insists that those mechanisms must be continuously challenged with real failures because assumptions about failure modes are often wrong. The older frameworks—hardware redundancy, RAS, software fault tolerance, transaction processing, Byzantine fault tolerance, and state machine replication—remain in use where their assumptions hold. Hardware redundancy still protects against component failures in safety-critical systems. Transaction processing and state machine replication provide the foundation for databases and distributed coordination. The division of labor is clear: older frameworks handle well-understood failure modes, while the newer frameworks address the complexity and unpredictability of large-scale, rapidly changing systems.