Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gray-failure] Add simulation test that shows RemoteTLog issue #11672

Closed
wants to merge 1 commit into from

Conversation

spraza
Copy link
Collaborator

@spraza spraza commented Sep 20, 2024

In a scenario where a remote tLog has an outbound network issue, for example to a remote SS, then the gray failure algorithm currently does not detect and recover from this case. I am working on fixing this, but meanwhile sending the simulation test for review. In the simulation test, I have commented out the assert for now, and will enable it as part of the fix PR.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@spraza spraza requested a review from jzhou77 September 20, 2024 05:54
processesPerMachine = 1
machineCount = 20
generateFearless = true
minimumRegions = 2
Copy link
Collaborator Author

@spraza spraza Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Although generateFearless simulates HA mode with 4 DCs, it's possible all those DCs are in the same region (it happens in this PR with rocksdb storage engine). So setting minimumRegions to 2 explicitly since the test relies on 2 regions.

Comment on lines +89 to +90
// TODO (praza): Uncomment this assert when gray failure detection is improved to automatically fix this issue
// ASSERT(maxSSLagSec < self->lagThresholdSec);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that the test fails when the assert is uncommented.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 81c773f
  • Duration 0:07:05
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 81c773f
  • Duration 0:11:40
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 81c773f
  • Duration 0:21:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 81c773f
  • Duration 0:51:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 81c773f
  • Duration 0:54:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 81c773f
  • Duration 1:06:20
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 81c773f
  • Duration 1:15:56
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Comment on lines +96 to +106
Transaction tr(db);
tr.setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
tr.setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE);
tr.setOption(FDBTransactionOptions::LOCK_AWARE);
std::vector<std::pair<StorageServerInterface, ProcessClass>> results =
wait(NativeAPI::getServerListAndProcessClasses(&tr));
for (auto& [ssi, p] : results) {
if (ssi.locality.dcId().present() && ssi.locality.dcId().get() == g_simulator->remoteDcId) {
ret.push_back(ssi.address().ip);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These txn read should be put into a try...catch block so that when encountering errors, the read can be retried.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will update the PR with this fix soon.

TraceEvent("UnclogRemoteTLog").detail("SrcIP", remoteTLogIP).detail("DstIP", remoteSSIP);

// We only clog once, so this actor never finishes
wait(Never());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this actor never finishes, then workload() can only exit because of a cancel due to timeout. Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancel due to timeout (value is testDuration) is intentional. Overall intent is:

  1. Run measurement and clog actors concurrently.
  2. If the measurement actor sees high SS lag, it asserts (will enable in feature PR).
  3. Clog actor's job is to just clog+unclog and then do nothing until end of test (based on timeout/testDuration value).

Any concerns with this approach?

Relatedly, I still want to ensure clog+unclog have already happened before end of test (based on timeout/testDuration). Theoretically, it looks like it could be possible that we are stuck somewhere and the test exits, but we haven't done clog+unclog. In this kind of situation, the test should fail. To fix this:

  1. I can maintain explicit state that the clog+unclog happened, as well as we have enough measurement samples, and assert at test exit time that state value is as expected.
  2. Try this approach: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L197-L201. But I am not sure how we pick 10 as in the workloadEnd calculation here: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L168.

Any thoughts based on general practice of writing reliable simulation tests?

lagMeasurementFrequencySec = getOption(options, "lagMeasurementFrequencySec"_sr, 5);
clogInitDelaySec = getOption(options, "clogInitDelaySec"_sr, 5);
clogDurationSec = getOption(options, "clogDurationSec"_sr, 5);
lagThresholdSec = getOption(options, "lagThresholdSec"_sr, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe try a larger value, say 60, to see if the commented out assertion still triggers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current values in TOML file trigger the assert (if uncommented). I should update the default values here to reflect TOML file values, so that even if TOML file does not specify values, the defaults are good enough to reproduce the issue. Makes sense? Or did you have any other concern?

@spraza
Copy link
Collaborator Author

spraza commented Oct 16, 2024

Fyi: dropping this PR. The new test (with addressed feedback and other improvements) as well as the gray failure features are here: #11717.

@spraza spraza closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants