Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fdbserver] Gray failure and simulator improvements related to remote processes #11717

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

spraza
Copy link
Collaborator

@spraza spraza commented Oct 16, 2024

This PR adds a group of gray failure features related to remote processes. In order to achieve these features, the simulator is made more customizable, in addition to adding more in-memory state in CC. The reviewer can review commit by commit if it's more convenient, here is a high level overview of these commits:

  1. In simulator, added a new option to specify number of desired remote tlogs.
  2. In simulator, added a new option to specify number of stateless class machines in each DC.
  3. Added a new sim test which reproduces the issue where a degraded remote tlog can cause high remote SS lag. This test uses the simulator improvements mentioned above. This test also also adds state machine logic of test phases to ensure test goes through a valid state transition path. This does depend on topology though, so in some cases, the test may choose to skip the state path check.
  4. Added multiple new gray failure features: remote log router latency degradation check, remote tlog latency degradation check, remote tlog disconnect check. All these health checks can result in gray failure excluding these bad processes and triggering recovery.
  5. Fundamentally, the reason in the past where we couldn't do latency based degradation for remote processes is that CC was not aware of inter vs intra DC latency. This PR adds that capability now, which means that remote process complaints from workers are considered by CC if both the complainer worker and the bad peer are in the same DC. To achieve this, and make it O(1), an addr_locality mapping is introduced. It's populated (insert, erase) similar to id_worker mapping.

100K Joshua: 20241018-002455-praza-40b0433b9b2f972dcc759bf2d19aaa43470f7d

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

// remote log routers are experiencing degradation
// (latency) with their peers. Gray failure may trigger
// recovery based on this.
bool CC_ENABLE_REMOTE_LOG_ROUTER_MONITORING; // When enabled, gray failure tries to detect whether
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for not saying DISCONNECT in the name is to keep backwards compatibility with the previous knob.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 8a0469e
  • Duration 0:17:57
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 8a0469e
  • Duration 0:18:13
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 8a0469e
  • Duration 0:21:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 8a0469e
  • Duration 0:44:19
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 8a0469e
  • Duration 0:48:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 8a0469e
  • Duration 0:56:05
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 8a0469e
  • Duration 0:57:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza spraza changed the title [fdbserver] Gray failure and simulator improvements [fdbserver] Gray failure and simulator improvements related to remote processes Oct 16, 2024
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 3f6dd3d
  • Duration 0:22:56
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 3f6dd3d
  • Duration 0:28:35
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 3f6dd3d
  • Duration 0:28:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 3f6dd3d
  • Duration 0:48:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 3f6dd3d
  • Duration 0:52:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

fix compiler warning
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: cefc144
  • Duration 0:21:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: cefc144
  • Duration 0:47:26
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: cefc144
  • Duration 0:52:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: cefc144
  • Duration 0:58:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: cefc144
  • Duration 1:09:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

testState = localState;
}
}
when(wait(clog)) {}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

state Future<Void> clog = self->clogRemoteTLog(self, db);
state TestState testState = TestState::TEST_INIT;
self->actualStatePath.push_back(testState);
loop choose {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove choose when

Comment on lines 23 to 24
#include "fdbrpc/Locality.h"
#include "flow/NetworkAddress.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move below

// Currently, only CC_ONLY_CONSIDER_INTRA_DC_LATENCY feature relies on addr_locality mapping. In the future, if
// needed, we can populate the mapping unconditionally.
if (SERVER_KNOBS->CC_ONLY_CONSIDER_INTRA_DC_LATENCY) {
const bool addrDcChanged = self->addr_locality.contains(w.address()) &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, understand why we can not relax this condition.

incorporate Jingyu's feedback
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 40b0433
  • Duration 0:22:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 40b0433
  • Duration 0:44:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 40b0433
  • Duration 0:50:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 40b0433
  • Duration 0:53:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 40b0433
  • Duration 0:56:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 40b0433
  • Duration 1:06:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 40b0433
  • Duration 1:09:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants