-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16721 dtx: handle potential DTX ID reusing trouble - b26 #15409
Conversation
Ticket title is 'Aurora: mdtest assertion on punch with 530 servers, 2048 * 104 clients' |
src/object/cli_obj.c
Outdated
"Miss 'RESEND' flag (%x) when resend the RPC for task %p: %u\n", | ||
obj_auxi->flags, task, obj_auxi->retry_cnt); | ||
if (task->dt_result == -DER_TX_ID_REUSED && obj_auxi->retry_cnt != 0) { | ||
D_ERROR("Be complained as TX ID reused for unknown reason, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the first part "Be complained as", as that is not a valid english phrase.
3e98f39
to
99c5406
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/4/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/4/display/redirect |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/4/display/redirect |
99c5406
to
ef02646
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/6/display/redirect |
The patch contains the following improvements: 1. When VOS level logic returns -DER_TX_RESATRT, the object level RPC handler should set 'RESEND' flag then restart the transaction with newer epoch. Because dtx_abort() logic cannot guarantee all former prepared DTX entries (on all related participants) can be aborted, especially if the former one failed for some network trouble, that may cause restarted transaction hit -DER_TX_ID_REUSED unexpectedly. 2. Compare the epoch for DTX entries with the same transaction ID for distinguishing potential reused TX ID more accurately. 3. Add DTX entry into DTX CoS cache if cannot commit it synchronously. Then subsequent batched commit logic can handle it. 4. If server complains suspected TX ID reusing, then reports -EIO to related application instead of assertion on client. 5. Control DTX related warning message frequency to avoid log flood. 6. Collect more information when generate some error/warning message. Allow-unstable-test: true Signed-off-by: Fan Yong <[email protected]>
ef02646
to
c4c195d
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/8/display/redirect |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/9/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/9/display/redirect |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/9/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/10/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/11/display/redirect |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/13/display/redirect |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/13/display/redirect |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/13/display/redirect |
The https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/13 build failed to run the Functional Hardware stages due to provisioning errors. I've manually kicked off https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/14/ to just run the Functional Hardware Medium, Medium Verbs Provider, and Large stages. |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/14/display/redirect |
Functional HW testing passed in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15409/16/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like mostly clean cherry pick, although it's not a familiar area of code LGTM
The patch contains the following improvements:
When VOS level logic returns -DER_TX_RESATRT, the object level RPC handler should set 'RESEND' flag then restart the transaction with newer epoch. Because dtx_abort() logic cannot guarantee all former prepared DTX entries (on all related participants) can be aborted, especially if the former one failed for some network trouble, that may cause restarted transaction hit -DER_TX_ID_REUSED unexpectedly.
Compare the epoch for DTX entries with the same transaction ID for distinguishing potential reused TX ID more accurately.
Add DTX entry into DTX CoS cache if cannot commit it synchronously. Then subsequent batched commit logic can handle it.
If server complains suspected TX ID reusing, then reports -EIO to related application instead of assertion on client.
Control DTX related warning message frequency to avoid log flood.
Collect more information when generate some error/warning message.
Allow-unstable-test: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: