A New Distributed Algorithm that Solves Performance and Reliability Problems of Consensus Protocols

Category Computer Science

tldr #

EPFL researchers have developed a new distributed algorithm called Quepaxa that solves one of the key performance and reliability problems of the currently-deployed consensus protocols. Quepaxa is an asynchronous consensus protocol that is just as fast and efficient as the widely deployed leader-based protocols under normal network conditions and automatically switches to its asynchronous properties to keep working if the network or leader fails.


content #

EPFL researchers have developed a new distributed algorithm that, for the first time, solves one of the key performance and reliability problems affecting most of the currently-deployed consensus protocols. The work has been published in Proceedings of the 29th Symposium on Operating Systems Principles.

Consensus is one of the fundamental problems in distributed systems. It allows a group of machines to maintain multiple copies of data and update them consistently, even when a fraction of the machines might fail. Take the example of three servers that need to store three copies of data and keep track of any updates to information so that all three servers remain consistent. If one server fails, the remaining two must keep the data consistent, and allow updates to continue normally as if there was no failure.

QuePaxa protocol is the first asynchronous consensus protocol that is just as fast and efficient as the widely deployed leader-based protocols under normal network conditions

Current state-of-the-art consensus protocols to achieve consensus rely on one computer node being designated a leader at any given time, continually supervising and handling any updates to data. If the leader fails another node wakes up and takes over, but there's a challenge. How long should another node wait before taking over from an unresponsive leader? .

"If the leader fails or the network is bad, the problem with the classic consensus protocols is that there's the very tricky question of how you decide how big or small the timeout should be," explained Professor Bryan Ford, Head of the Decentralized and Distributed Systems Laboratory (DEDIS) in EPFL's School of Computer and Communications Sciences (IC).

Consensus protocols can be vulnerable to denial of service attacks

"If you set it too big, then when a leader fails, you might be waiting a long time and the system is just dead. On the other hand, consider if you set the timeout too short—this is where the real disaster can happen." .

"Suppose the old leader hasn't failed, suppose the network is just a little slower than you thought it was, the next leader comes and tries to take over, but the way all the existing protocols work, the new leader's actions will cancel what the old leader's actions did so it can no longer finish what it was doing and all its work is wasted. These kinds of issues can cause major reliability problems and these leader-based protocols can fail entirely if there's a deliberate denial of service attack," he continued.

QuePaxa's asynchronous properties allow it to work even if a leader fails

To overcome these challenges, DEDIS researchers have been investigating a rarely-used class of consensus algorithms, known as asynchronous consensus protocols. Unlike current leader-based protocols, their asynchronous cousins are not vulnerable to leader failures and denial of service attacks. But there's a big trade off—prior asynchronous protocols are much less efficient under normal conditions, and that's one reason they are almost never deployed.

Current leader-based consensus protocols rely on the designated leader supervising and handling any updates to data

For the first time, Ford says, their QuePaxa protocol changes this dynamic. "We've come up with a win-win. What is new and unique to QuePaxa is that it's an asynchronous consensus protocol that finally achieves efficiency equivalent to the widely deployed leader-based protocols under normal network conditions. QuePaxa is just as fast, efficient, low latency and low cost in terms of network bandwidth, under normal conditions. But as soon as a leader fails or the network is congested, it automatically starts to use its asynchronous properties to keep working." .

Consensus is a fundamental problem in distributed systems that ensures multiple copies of data are updated consistently

hashtags #
worddensity #

Share