-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gray-failure] Add simulation test that shows RemoteTLog issue #11672
Conversation
processesPerMachine = 1 | ||
machineCount = 20 | ||
generateFearless = true | ||
minimumRegions = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Although generateFearless
simulates HA mode with 4 DCs, it's possible all those DCs are in the same region (it happens in this PR with rocksdb storage engine). So setting minimumRegions
to 2 explicitly since the test relies on 2 regions.
// TODO (praza): Uncomment this assert when gray failure detection is improved to automatically fix this issue | ||
// ASSERT(maxSSLagSec < self->lagThresholdSec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that the test fails when the assert is uncommented.
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Transaction tr(db); | ||
tr.setOption(FDBTransactionOptions::READ_SYSTEM_KEYS); | ||
tr.setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE); | ||
tr.setOption(FDBTransactionOptions::LOCK_AWARE); | ||
std::vector<std::pair<StorageServerInterface, ProcessClass>> results = | ||
wait(NativeAPI::getServerListAndProcessClasses(&tr)); | ||
for (auto& [ssi, p] : results) { | ||
if (ssi.locality.dcId().present() && ssi.locality.dcId().get() == g_simulator->remoteDcId) { | ||
ret.push_back(ssi.address().ip); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These txn read should be put into a try...catch
block so that when encountering errors, the read can be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will update the PR with this fix soon.
TraceEvent("UnclogRemoteTLog").detail("SrcIP", remoteTLogIP).detail("DstIP", remoteSSIP); | ||
|
||
// We only clog once, so this actor never finishes | ||
wait(Never()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this actor never finishes, then workload()
can only exit because of a cancel due to timeout. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cancel due to timeout (value is testDuration) is intentional. Overall intent is:
- Run measurement and clog actors concurrently.
- If the measurement actor sees high SS lag, it asserts (will enable in feature PR).
- Clog actor's job is to just clog+unclog and then do nothing until end of test (based on timeout/testDuration value).
Any concerns with this approach?
Relatedly, I still want to ensure clog+unclog have already happened before end of test (based on timeout/testDuration). Theoretically, it looks like it could be possible that we are stuck somewhere and the test exits, but we haven't done clog+unclog. In this kind of situation, the test should fail. To fix this:
- I can maintain explicit state that the clog+unclog happened, as well as we have enough measurement samples, and assert at test exit time that state value is as expected.
- Try this approach: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L197-L201. But I am not sure how we pick
10
as in the workloadEnd calculation here: https://github.com/spraza/foundationdb/blob/392bad2bd31ba146e05641680929da2c10480bc5/fdbserver/workloads/ClogTlog.actor.cpp#L168.
Any thoughts based on general practice of writing reliable simulation tests?
lagMeasurementFrequencySec = getOption(options, "lagMeasurementFrequencySec"_sr, 5); | ||
clogInitDelaySec = getOption(options, "clogInitDelaySec"_sr, 5); | ||
clogDurationSec = getOption(options, "clogDurationSec"_sr, 5); | ||
lagThresholdSec = getOption(options, "lagThresholdSec"_sr, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe try a larger value, say 60, to see if the commented out assertion still triggers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current values in TOML file trigger the assert (if uncommented). I should update the default values here to reflect TOML file values, so that even if TOML file does not specify values, the defaults are good enough to reproduce the issue. Makes sense? Or did you have any other concern?
Fyi: dropping this PR. The new test (with addressed feedback and other improvements) as well as the gray failure features are here: #11717. |
In a scenario where a remote tLog has an outbound network issue, for example to a remote SS, then the gray failure algorithm currently does not detect and recover from this case. I am working on fixing this, but meanwhile sending the simulation test for review. In the simulation test, I have commented out the assert for now, and will enable it as part of the fix PR.
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)