Byzantine failure
|
In fault-tolerant distributed computing, a Byzantine failure is an arbitrary fault that occurs during the execution of an algorithm by a distributed system. It encompasses those faults that are commonly referred to as "crash failures" and "send and omission failures." When a Byzantine failure has occurred, the system may respond in any unpredictable way.
These arbitrary failures may be loosely categorized as follows:
- a failure to take another step in the algorithm, also known as a crash failure;
- a failure to correctly execute a step of the algorithm; and
- arbitrary execution of a step other than the one indicated by the algorithm.
Steps are taken by processes, the abstractions that execute the algorithms. A faulty process is one that at some point exhibits one of the above failures. A process that is not faulty is correct.
Byzantine refers to the Byzantine Generals' Problem, an agreement problem in which generals of the Byzantine Empire's army must decide unanimously whether or not to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: trick some generals into attacking; force a decision that is not consistent with the generals' desires, e.g. forcing an attack when no general wished to attack; or so confusing some generals that they never make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory.
The Byzantine failure assumption models real-world environments in which computers and networks may behave in unexpected ways due to hardware failures, network congestion and disconnection, as well as malicious attacks. Byzantine failure-tolerant algorithms must cope with such failures and still satisfy the specifications of the problems they are designed to solve. Such algorithms are commonly characterized by their resilience t, the number of faulty processes with which an algorithm can cope.
Many classic agreement problems, such as the Byzantine Generals Problem, have no solution unless t<n/3, where n is the number of processes in the system.
References
- L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem (http://research.microsoft.com/users/lamport/pubs/byz.pdf), ACM Trans. Programming Languages and Systems, Vol. 4, No. 3, July 1982, pp. 382-401.