Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-998: shouldWithholdVotes() should be triggered for handling higher term #144

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

GlenGeng-awx
Copy link
Contributor

What changes were proposed in this pull request?

shouldWithholdVotes() should be triggered for handling higher term

What is the link to the Apache JIRA

https://issues.apache.org/jira/projects/RATIS/issues/RATIS-998?filter=allissues

How was this patch tested?

CI

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GlenGeng Thanks for working on this! I think this works as expected. withhold vote should be false if server term is less than candidate term and would be true if server term >= candidateTerm.

@GlenGeng-awx
Copy link
Contributor Author

GlenGeng-awx commented Jul 15, 2020

@szetszwo Hello Nicholas, could you please have a quick look for this PR ?

I consider that:

  1. withhold vote should be used for higher term from disruptive server,
  2. request vote can work correctly, but shouldWithholdVotes is redundant.

Since the change is critical, could you please double confirm. Thanks!

@@ -785,7 +785,7 @@ public RaftClientReply setConfiguration(SetConfigurationRequest request) throws
}

private boolean shouldWithholdVotes(long candidateTerm) {
if (state.getCurrentTerm() < candidateTerm) {
if (state.getCurrentTerm() >= candidateTerm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behind the change? It seems to contradict the raft paper.

Copy link
Contributor Author

@GlenGeng-awx GlenGeng-awx Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does obey the paper.

4.2.3 Disruptive servers
Without additional mechanism, servers not in Cnew can disrupt the cluster. Once the cluster leader has created the Cnew entry, a server that is not in Cnew will no longer receive heartbeats, so it will time out and start new elections. Furthermore, it will not receive the Cnew entry or learn of that entry’s commitment, so it will not know that it has been removed from the cluster. The server will send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader from Cnew will eventually be elected, but the disruptive server will time out again and the process will repeat, resulting in poor availability. If multiple servers have been removed from the cluster, the situation could degrade further.

Because of this scenario, we now believe that no solution based on comparing logs alone (such as the Pre-Vote check) will be sufficient to tell if an election will be disruptive.

Raft’s solution uses heartbeats to determine when a valid leader exists. In Raft, a leader is
considered active if it is able to maintain heartbeats to its followers (otherwise, another server will start an election). Thus, servers should not be able to disrupt a leader whose cluster is receiving heartbeats. We modify the RequestVote RPC to achieve this: if a server receives a RequestVote request within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. It can either drop the request, reply with a vote denial, or delay the request; the result is essentially the same. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from servers not in Cnew: while a leader is able to get heartbeats to its cluster, it will not be deposed by larger term numbers.

BTW, shouldWithholdVotes is redundant now, the cases handled by shouldWithholdVotes() can also handled by recognizeCandidate(). if currentTerm is larger than candidateTerm, just reject the RequestVote request, no further handling is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (state.getCurrentTerm() >= candidateTerm), the server should not vote it. Why returning false?

Honestly, I don't understand what you are trying to fix. Could you add a unit test to show it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key point here is, voting should only depends on voterFor and currentTerm, whose correctness is proved formally

截屏2020-08-08 下午1 59 32

Voting that depends on role and requestTime/responseTime is not a theoretically proved way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants