[gray-failure] Add simulation test that shows RemoteTLog issue #11672

spraza · 2024-09-20T05:54:33Z

In a scenario where a remote tLog has an outbound network issue, for example to a remote SS, then the gray failure algorithm currently does not detect and recover from this case. I am working on fixing this, but meanwhile sending the simulation test for review. In the simulation test, I have commented out the assert for now, and will enable it as part of the fix PR.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

spraza · 2024-09-20T05:56:41Z

tests/rare/ClogRemoteTlog.toml

+processesPerMachine = 1
+machineCount = 20
+generateFearless = true
+minimumRegions = 2


FYI: Although generateFearless simulates HA mode with 4 DCs, it's possible all those DCs are in the same region (it happens in this PR with rocksdb storage engine). So setting minimumRegions to 2 explicitly since the test relies on 2 regions.

spraza · 2024-09-20T05:58:57Z

fdbserver/workloads/ClogRemoteTlog.actor.cpp

+ // TODO (praza): Uncomment this assert when gray failure detection is improved to automatically fix this issue
+ // ASSERT(maxSSLagSec < self->lagThresholdSec);


I have confirmed that the test fails when the assert is uncommented.

foundationdb-ci · 2024-09-20T06:01:48Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 81c773f
Duration 0:07:05
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-09-20T06:06:24Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 81c773f
Duration 0:11:40
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-09-20T06:16:09Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 81c773f
Duration 0:21:27
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-09-20T06:45:57Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 81c773f
Duration 0:51:14
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-09-20T06:49:07Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 81c773f
Duration 0:54:24
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-09-20T07:01:06Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 81c773f
Duration 1:06:20
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-09-20T07:10:41Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 81c773f
Duration 1:15:56
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

jzhou77 · 2024-09-20T16:06:46Z

fdbserver/workloads/ClogRemoteTlog.actor.cpp

+ Transaction tr(db);
+ tr.setOption(FDBTransactionOptions::READ_SYSTEM_KEYS);
+ tr.setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE);
+ tr.setOption(FDBTransactionOptions::LOCK_AWARE);
+ std::vector<std::pair<StorageServerInterface, ProcessClass>> results =
+ wait(NativeAPI::getServerListAndProcessClasses(&tr));
+ for (auto& [ssi, p] : results) {
+ if (ssi.locality.dcId().present() && ssi.locality.dcId().get() == g_simulator->remoteDcId) {
+ ret.push_back(ssi.address().ip);
+ }
+ }


These txn read should be put into a try...catch block so that when encountering errors, the read can be retried.

Good point, will update the PR with this fix soon.

jzhou77 · 2024-09-20T16:09:47Z

fdbserver/workloads/ClogRemoteTlog.actor.cpp

+ TraceEvent("UnclogRemoteTLog").detail("SrcIP", remoteTLogIP).detail("DstIP", remoteSSIP);
+
+ // We only clog once, so this actor never finishes
+ wait(Never());


If this actor never finishes, then workload() can only exit because of a cancel due to timeout. Is this intentional?

The cancel due to timeout (value is testDuration) is intentional. Overall intent is:

Run measurement and clog actors concurrently.

If the measurement actor sees high SS lag, it asserts (will enable in feature PR).

Clog actor's job is to just clog+unclog and then do nothing until end of test (based on timeout/testDuration value).

Any concerns with this approach?

Relatedly, I still want to ensure clog+unclog have already happened before end of test (based on timeout/testDuration). Theoretically, it looks like it could be possible that we are stuck somewhere and the test exits, but we haven't done clog+unclog. In this kind of situation, the test should fail. To fix this:

I can maintain explicit state that the clog+unclog happened, as well as we have enough measurement samples, and assert at test exit time that state value is as expected.

Try this approach: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L197-L201. But I am not sure how we pick 10 as in the workloadEnd calculation here: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L168.

Any thoughts based on general practice of writing reliable simulation tests?

jzhou77 · 2024-09-20T16:12:49Z

fdbserver/workloads/ClogRemoteTlog.actor.cpp

+ lagMeasurementFrequencySec = getOption(options, "lagMeasurementFrequencySec"_sr, 5);
+ clogInitDelaySec = getOption(options, "clogInitDelaySec"_sr, 5);
+ clogDurationSec = getOption(options, "clogDurationSec"_sr, 5);
+ lagThresholdSec = getOption(options, "lagThresholdSec"_sr, 5);


maybe try a larger value, say 60, to see if the commented out assertion still triggers?

The current values in TOML file trigger the assert (if uncommented). I should update the default values here to reflect TOML file values, so that even if TOML file does not specify values, the defaults are good enough to reproduce the issue. Makes sense? Or did you have any other concern?

spraza · 2024-10-16T00:38:51Z

Fyi: dropping this PR. The new test (with addressed feedback and other improvements) as well as the gray failure features are here: #11717.

[gray-failure] Add simulation test that shows RemoteTLog issue

81c773f

spraza requested a review from jzhou77 September 20, 2024 05:54

spraza commented Sep 20, 2024

View reviewed changes

jzhou77 reviewed Sep 20, 2024

View reviewed changes

spraza closed this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gray-failure] Add simulation test that shows RemoteTLog issue #11672

[gray-failure] Add simulation test that shows RemoteTLog issue #11672

spraza commented Sep 20, 2024

spraza Sep 20, 2024 •

edited

Loading

spraza Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

foundationdb-ci commented Sep 20, 2024

jzhou77 Sep 20, 2024

spraza Sep 20, 2024

jzhou77 Sep 20, 2024

spraza Sep 20, 2024

jzhou77 Sep 20, 2024

spraza Sep 20, 2024

spraza commented Oct 16, 2024

		// TODO (praza): Uncomment this assert when gray failure detection is improved to automatically fix this issue
		// ASSERT(maxSSLagSec < self->lagThresholdSec);

[gray-failure] Add simulation test that shows RemoteTLog issue #11672

[gray-failure] Add simulation test that shows RemoteTLog issue #11672

Conversation

spraza commented Sep 20, 2024

Code-Reviewer Section

For Release-Branches

spraza Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

spraza Sep 20, 2024

Choose a reason for hiding this comment

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-clang-arm on Linux CentOS 7

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Sep 20, 2024

Result of foundationdb-pr on Linux CentOS 7

jzhou77 Sep 20, 2024

Choose a reason for hiding this comment

spraza Sep 20, 2024

Choose a reason for hiding this comment

jzhou77 Sep 20, 2024

Choose a reason for hiding this comment

spraza Sep 20, 2024

Choose a reason for hiding this comment

jzhou77 Sep 20, 2024

Choose a reason for hiding this comment

spraza Sep 20, 2024

Choose a reason for hiding this comment

spraza commented Oct 16, 2024

spraza Sep 20, 2024 •

edited

Loading