-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16702 rebuild: restart rebuild for a massive failure case #15406
base: release/2.6
Are you sure you want to change the base?
Conversation
Ticket title is 'Rebuilding cannot be completed after restarting ranks in cases of massive failures.' |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/313/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/344/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/282/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/357/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/453/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/1/execution/node/495/log |
eed3dc3
to
0f3769a
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15406/2/execution/node/1397/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15406/4/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15406/4/display/redirect |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15406/5/display/redirect |
0f3769a
to
0bfe900
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15406/6/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15406/6/display/redirect |
In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. another backport commit - 947c76d DAOS-16175 container: fix a case for cont_iv_hdl_fetch (#15395) Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>
0bfe900
to
9643482
Compare
In special massive failure case -
In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole system including the PS leader.
another backport commit -
947c76d DAOS-16175 container: fix a case for cont_iv_hdl_fetch (#15395)
Skip-nlt: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: