Misplaced Pages

Byzantine fault

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

A Byzantine fault is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed. The term takes its name from an allegory , the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of a system, the system's actors must agree on a strategy, but some of these actors are unreliable in such a way as to cause other (good) actors to disagree on the strategy and they may be unaware of the disagreement.

#345654

79-403: A Byzantine fault is also known as a Byzantine generals problem , a Byzantine agreement problem , or a Byzantine failure . Byzantine fault tolerance ( BFT ) is the resilience of a fault-tolerant computer system or similar system to such conditions. A Byzantine fault is any fault presenting different symptoms to different observers. A Byzantine failure is the loss of a system service due to

158-409: A blockchain with proof-of-work allowing the system to overcome Byzantine failures and reach a coherent global view of the system's state. Some proof of stake blockchains also use BFT algorithms. Byzantine Fault Tolerance (BFT) is a crucial concept in blockchain technology , ensuring that a network can continue to function even when some nodes (participants) fail or act maliciously. This tolerance

237-444: A fail-fast component is designed to report at the first point of failure, rather than generating reports when downstream components fail. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state. A single fault condition is a situation where one means for protection against a hazard is defective. If a single fault condition results unavoidably in another single fault condition,

316-415: A minimal layout, to ensure wide accessibility and outreach , such as on game consoles with limited web browsing capabilities. A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails. A system that

395-445: A rout , and would be worse than either a coordinated attack or a coordinated retreat. The problem is complicated by the presence of treacherous generals who may not only cast a vote for a suboptimal strategy; they may do so selectively. For instance, if nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general may send a vote of retreat to those generals in favor of retreat, and

474-416: A Byzantine fault in systems that require consensus among multiple components. The Byzantine allegory considers a number of generals who are attacking a fortress. The generals must decide as a group whether to attack or retreat; some may prefer to attack, while others prefer to retreat. The important thing is that all generals agree on a common decision, for a halfhearted attack by a few generals would become

553-501: A city. In its original version, the story cast the generals as commanders of the Albanian army. The name was changed, eventually settling on " Byzantine ", at the suggestion of Jack Goldberg to future-proof any potential offense-giving. This formulation of the problem, together with some additional results, were presented by the same authors in their 1982 paper, "The Byzantine Generals Problem". The objective of Byzantine fault tolerance

632-575: A consensus quickly and securely. These networks often use BFT protocols to enhance performance and security. Some aircraft systems, such as the Boeing 777 Aircraft Information Management System (via its ARINC 659 SAFEbus network), the Boeing 777 flight control system, and the Boeing 787 flight control systems, use Byzantine fault tolerance; because these are real-time systems, their Byzantine fault tolerance solutions must have very low latency. For example, SAFEbus can achieve Byzantine fault tolerance within

711-404: A conspiracy of n faulty computers could not "thwart" the efforts of the correctly-operating ones to reach consensus. Shostak showed that a minimum of 3 n+ 1 are needed, and devised a two-round 3 n+1 messaging protocol that would work for n =1. His colleague Marshall Pease generalized the algorithm for any n > 0, proving that 3 n +1 is both necessary and sufficient. These results, together with

790-410: A different type of fracture surface, and other indicators near the fracture surface(s). The way the product is loaded, and the loading history are also important factors which determine the outcome. Of critical importance is design geometry because stress concentrations can magnify the applied load locally to very high levels, and from which cracks usually grow. Over time, as more is understood about

869-662: A failure, the failure cause evolves from a description of symptoms and outcomes (that is, effects) to a systematic and relatively abstract model of how, when, and why the failure comes about (that is, causes). The more complex the product or situation, the more necessary a good understanding of its failure cause is to ensuring its proper operation (or repair). Cascading failures , for example, are particularly complex failure causes. Edge cases and corner cases are situations in which complex, unexpected, and difficult-to-debug problems often occur. Materials can be degraded by their environment by corrosion processes, such as rusting in

SECTION 10

#1732782468346

948-572: A flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring it to a halt. On motorcycles, a similar level of fail-safety is provided by simpler methods; first, the front and rear brake systems are entirely separate, regardless of their method of activation (that can be cable, rod or hydraulic), allowing one to fail entirely while leaving

1027-404: A hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability. Fault-tolerant systems are typically based on the concept of redundancy. Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system,

1106-473: A later proof by Leslie Lamport of the sufficiency of 3 n using digital signatures, were published in the seminal paper, Reaching Agreement in the Presence of Faults. The authors were awarded the 2005 Edsger W. Dijkstra Prize for this paper. To make the interactive consistency problem easier to understand, Lamport devised a colorful allegory in which a group of army generals formulate a plan for attacking

1185-412: A number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant: An example of a component that passes all the tests is a car's occupant restraint system. While the primary occupant restraint system is not normally thought of, it is gravity . If

1264-526: A role at the same time. They include corrosion , welding of contacts due to an abnormal electric current, return spring fatigue failure , unintended command failure, dust accumulation and blockage of mechanism, etc. Seldom only one cause (hazard) can be identified that creates system failures. The real root causes can in theory in most cases be traced back to some kind of human error, e.g. design failure, operational errors, management failures, maintenance induced failures, specification failures, etc. A scenario

1343-451: A single bit of information if self-checking pairs are used for nodes) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms. For systems that have a high degree of safety or security criticality, these assumptions must be proven to be true to an acceptable level of fault coverage . When providing proof through testing, one difficulty

1422-404: A sufficient number of accurately-operating components to maintain the service. When considering failure propagation only via errors, Byzantine failures are considered the most general and most difficult class of failures among the failure modes . The so-called fail-stop failure mode occupies the simplest end of the spectrum. Whereas the fail-stop failure mode simply means that the only way to fail

1501-468: A system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance. Typically, fault tolerance describes computer systems , ensuring

1580-439: A vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will attack (which may not go well for the attackers). The problem is complicated further by the generals being physically separated and having to send their votes via messengers who may fail to deliver votes or may forge false votes. Byzantine fault tolerance can be achieved if the number of loyal (non-faulty) generals

1659-409: A warning to the operator, and it is still the most common form of level one fault-tolerant design in use today. Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results. For example, if four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component

SECTION 20

#1732782468346

1738-470: Is pair-and-spare . Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially. Failure-oblivious computing

1817-408: Is a node crash, detected by other nodes, Byzantine failures imply no restrictions on what errors can be created, which means that a failed node can generate arbitrary data, including data that makes it appear like a functioning node to a subset of other nodes. Thus, Byzantine failures can confuse failure detection systems, which makes fault tolerance difficult. Despite the allegory, a Byzantine failure

1896-516: Is a difference between fault tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant . But when a fault did occur they still stopped operating completely, and therefore were not fault tolerant . Failure cause Failure causes are defects in design, process, quality, or part application, which are

1975-403: Is a technique that enables computer programs to continue executing despite errors . The technique can be applied in different contexts. It can handle invalid memory reads by returning a manufactured value to the program, which in turn, makes use of the manufactured value and ignores the former memory value it tried to access, this is a great contrast to typical memory checkers , which inform

2054-409: Is a technique to avoid catastrophic failures in distributed systems. Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. This can consist of backup components that automatically "kick in" if one component fails. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with

2133-618: Is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version. Progressive enhancement is another example, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display. In fault-tolerant computer systems, programs that are considered robust are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely. Software brittleness

2212-422: Is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find

2291-407: Is creating a sufficiently wide range of signals with Byzantine symptoms. Such testing will likely require specialized fault injectors . Byzantine errors were observed infrequently and at irregular points during endurance testing for the newly constructed Virginia class submarines , at least through 2005 (when the issues were publicly reported). The Bitcoin network works in parallel to generate

2370-412: Is designed to fail safe , or fail-secure, or fail gracefully , whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit (as opposed to an uncontrolled crash) to prevent data corruption after an error occurs. A similar distinction

2449-498: Is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this kind of testing is referred to as 'In Service Fault Tolerance Testing or ISFTT for short. Fault-tolerant design's advantages are obvious, while many of its disadvantages are not: There

Byzantine fault - Misplaced Pages Continue

2528-409: Is greater than three times the number of disloyal (faulty) generals. There can be a default vote value given to missing messages. For example, missing messages can be given a "null" value . Further, if the agreement is that the null votes are in the majority, a pre-assigned default strategy can be used (e.g., retreat). The typical mapping of this allegory onto computer systems is that the computers are

2607-495: Is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications. Lockstep fault-tolerant machines are most easily made fully synchronous , with each gate of each replication making

2686-409: Is made between "failing well" and " failing badly ". A system designed to experience graceful degradation , or to fail soft (used in computing, similar to "fail safe" ) operates at a reduced level of performance after some component fails. For example, if grid power fails, a building may operate lighting at reduced levels or elevators at reduced speeds. In computing, if insufficient network bandwidth

2765-423: Is necessary because blockchains are decentralized systems with no central authority, making it essential to achieve consensus among nodes, even if some try to disrupt the process. Safety Mechanisms: Different blockchains use various BFT-based consensus mechanisms like Practical Byzantine Fault Tolerance (PBFT), Tendermint, and Delegated Proof of Stake (DPoS) to handle Byzantine faults. These protocols ensure that

2844-600: Is not necessarily a security problem involving hostile human interference: it can arise purely from physical or software faults. The terms fault and failure are used here according to the standard definitions originally created by a joint committee on "Fundamental Concepts and Terminology" formed by the IEEE Computer Society's Technical Committee on Dependable Computing and Fault-Tolerance and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance. See also dependability . Byzantine fault tolerance

2923-538: Is only concerned with broadcast consistency, that is, the property that when a component broadcasts a value to all the other components, they all receive exactly this same value, or in the case that the broadcaster is not consistent, the other components agree on a common value themselves. This kind of fault tolerance does not encompass the correctness of the value itself; for example, an adversarial component that deliberately sends an incorrect value, but sends that same value consistently to all components, will not be caught in

3002-399: Is part of the process of fixing design flaws or improving future iterations . The term may be applied to mechanical systems failure. Some types of mechanical failure mechanisms are: excessive deflection, buckling , ductile fracture , brittle fracture , impact , creep, relaxation, thermal shock , wear , corrosion, stress corrosion cracking, and various types of fatigue. Each produces

3081-503: Is still working, as of early 2022. Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear power companies, and the railroad industry in the United States. These entities needed computers with massive amounts of uptime that would fail gracefully enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. Again, IBM developed

3160-436: Is the complete identified possible sequence and combination of events, failures (failure modes), conditions, system states, leading to an end (failure) system state. It starts from causes (if known) leading to one particular end effect (the system failure condition). A failure scenario is for a system the same as the failure mechanism is for a component. Both result in a failure mode (state) of the system / component. Rather than

3239-508: Is the opposite of robustness. Resilient networks continue to transmit data despite the failure of some links or nodes. Resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions. A system with high failure transparency will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. Likewise,

Byzantine fault - Misplaced Pages Continue

3318-435: Is to be able to defend against failures of system components with or without symptoms that prevent other components of the system from reaching an agreement among themselves, where such an agreement is needed for the correct operation of the system. The remaining operationally correct components of a Byzantine fault tolerant system will be able to continue providing the system's service as originally intended, assuming there are

3397-589: The mean time between failures should be long enough for the operators to have sufficient time to fix the broken devices ( mean time to repair ) before the backup also fails. It is helpful if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system. Fault tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single-point tolerance to create their NonStop systems with uptimes measured in years. Fail-safe architectures may encompass also

3476-645: The "Practical Byzantine Fault Tolerance" (PBFT) algorithm, which provides high-performance Byzantine state machine replication, processing thousands of requests per second with sub-millisecond increases in latency. After PBFT, several BFT protocols were introduced to improve its robustness and performance. For instance, Q/U, HQ, Zyzzyva, and ABsTRACTs, addressed the performance and cost issues; whereas other protocols, like Aardvark and RBFT, addressed its robustness issues. Furthermore, Adapt tried to make use of existing BFT protocols, through switching between them in an adaptive way, to improve system robustness and performance as

3555-612: The Byzantine fault tolerance scheme. Several early solutions were described by Lamport, Shostak, and Pease in 1982. They began by noting that the Generals' Problem can be reduced to solving a "Commander and Lieutenants" problem where loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is loyal: There are many systems that claim BFT without meeting

3634-508: The Computer Science Lab at SRI International . SIFT (for Software Implemented Fault Tolerance) was the brainchild of John Wensley, and was based on the idea of using multiple general-purpose computers that would communicate through pairwise messaging in order to reach a consensus, even if some of the computers were faulty. At the beginning of the project, it was not clear how many computers in total were needed to guarantee that

3713-530: The above minimum requirements (e.g., blockchain). Given that there is mathematical proof that this is impossible, these claims need to include a caveat that their definition of BFT strays from the original. That is, systems such as blockchain don't guarantee agreement, they only make disagreement expensive. Several system architectures were designed c. 1980 that implemented Byzantine fault tolerance. These include: Draper's FTMP, Honeywell's MMFCS, and SRI's SIFT. In 1999, Miguel Castro and Barbara Liskov introduced

3792-421: The compiled program binary directly and does not need to recompile to program. It uses the just-in-time binary instrumentation framework Pin . It attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from

3871-470: The computer software, for example by process replication . Data formats may also be designed to degrade gracefully. HTML for example, is designed to be forward compatible , allowing Web browsers to ignore new and unsupported HTML entities without causing the document to be unusable. Additionally, some sites, including popular platforms such as Twitter (until December 2020), provide an optional lightweight front end that does not rely on JavaScript and has

3950-399: The demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case. In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it

4029-426: The design of fault-tolerant computer systems for online transaction processing . Hardware fault tolerance sometimes requires that broken parts be taken out and replaced with new parts while the system is still operational (in computing known as hot swapping ). Such a system implemented with a single backup is known as single point tolerant and represents the vast majority of fault-tolerant systems. In such systems

SECTION 50

#1732782468346

4108-730: The development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory , and their second attempt, the JSTAR computer, was used in Voyager . This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer

4187-517: The exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail). The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s. Two kinds of redundancy are possible: space redundancy and time redundancy. Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy

4266-408: The first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF , Unisys , and General Electric built their own. In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy. In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something

4345-458: The first fundamental characteristic of fault tolerance in three ways: All implementations of RAID , redundant array of independent disks , except RAID 0, are examples of a fault-tolerant storage device that uses data redundancy . A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication , and

4424-414: The generals and their digital communication system links are the messengers. Although the problem is formulated in the allegory as a decision-making and security problem, in electronics, it cannot be solved by cryptographic digital signatures alone, because failures such as incorrect voltages can propagate through the encryption process. Thus, a faulty message could be sent such that some recipients detect

4503-469: The loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely (a relatively very rare occurrence), there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission/engine braking so long as

4582-451: The main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness). Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure,

4661-511: The majority of honest nodes can agree on the next block in the chain, securing the network against attacks and preventing double-spending and other types of fraud. Practical examples of networks include Hyperledger Fabric , Cosmos and Klever in this sequence. 51% Attack Mitigation: While traditional blockchains like Bitcoin use Proof of Work (PoW), which is susceptible to a 51% attack, BFT-based systems are designed to tolerate up to one-third of faulty or malicious nodes without compromising

4740-485: The message as faulty (bad signature), others see it is having a good signature, and a third group also sees a good signature but with different message contents than the second group. The problem of obtaining Byzantine consensus was conceived and formalized by Robert Shostak , who dubbed it the interactive consistency problem. This work was done in 1978 in the context of the NASA-sponsored SIFT project in

4819-523: The more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport, public utilities and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design , to arcane elements such as stochastic models, graph theory , formal or exclusionary logic, parallel processing , remote data transmission , and more. Spare components address

SECTION 60

#1732782468346

4898-449: The network's integrity. Decentralized Trust: Byzantine Fault Tolerance underpins the trust model in decentralized networks. Instead of relying on a central authority, the network's security depends on the ability of honest nodes to outnumber and outmaneuver malicious ones. Private and Permissioned Blockchains: BFT is especially important in private or permissioned blockchains, where a limited number of known participants need to reach

4977-445: The order of a microsecond of added latency. The SpaceX Dragon considers Byzantine fault tolerance in its design. Fault-tolerant computer system Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability , mission-critical , or even life-critical systems . Fault tolerance specifically refers to

5056-466: The other unaffected. Second, the rear brake is relatively strong compared to its automotive cousin, being a powerful disc on some sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tire is generally larger and has better traction, so that the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before

5135-666: The overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue , corrosion or impact. The first known fault-tolerant computer was SAPO , built in 1951 in Czechoslovakia by Antonín Svoboda . Its basic design was magnetic drums connected via relays, with a voting method of memory error detection ( triple modular redundancy ). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: Most of

5214-416: The process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead. For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs. The circuit breaker design pattern

5293-470: The program of the error or abort the program. The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%. Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero. Comparing to the failure oblivious computing technique, recovery shepherding works on

5372-399: The rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance. The basic characteristics of fault tolerance require: In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at

5451-423: The same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication

5530-466: The same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement. Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica. One variant of DMR

5609-688: The simple description of symptoms that many product users or process participants might use, the term failure scenario / mechanism refers to a rather complete description, including the preconditions under which failure occurs, how the thing was being used, proximate and ultimate/final causes (if known), and any subsidiary or resulting failures that result. The term is part of the engineering lexicon , especially of engineers working to test and debug products or processes. Carefully observing and describing failure conditions, identifying whether failures are reproducible or transient, and hypothesizing what combination of conditions and sequence of events led to failure

5688-617: The third test is passed. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as airbags , are more expensive and so pass that test by a smaller margin. Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up

5767-446: The two failures are considered one single fault condition. A source offers the following example: A single-fault condition is a condition when a single means for protection against hazard in equipment is defective or a single external abnormal condition is present, e.g. short circuit between the live parts and the applied part. Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings

5846-498: The underlying cause of a failure or which initiate a process which leads to failure. Where failure depends on the user of the product or process, then human error must be considered. A part failure mode is the way in which a component failed "functionally" on the component level. Often a part has only a few failure modes. For example, a relay may fail to open or close contacts on demand. The failure mechanism that caused this can be of many different kinds, and often multiple factors play

5925-551: The underlying conditions change. Furthermore, BFT protocols were introduced that leverage trusted components to reduce the number of replicas, e.g., A2M-PBFT-EA and MinBFT. Several examples of Byzantine failures that have occurred are given in two equivalent journal papers. These and other examples are described on the NASA DASHlink web pages. Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature, which can be reduced to just

6004-453: The vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so the first test is passed. Accidents causing occupant ejection were quite common before seat belts , so the second test is passed. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so

6083-409: The wheel locks. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if

6162-416: Was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and

6241-410: Was faulty and have it taken out of service. This is called M out of N majority voting. Historically, the trend has been to move away from N-model and toward M out of N, as the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations. Tandem Computers , in 1976 and Stratus were among the first companies specializing in

#345654