-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait to mark reports as aggregated until just before committing #374
Conversation
max_time: out_share.time, | ||
checksum: out_share.checksum, | ||
data: Some(out_share.data), | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't aggregate the output shares yet because we still may need to reject in case of replay.
// replayed. This would happen if, for example, a fresh report was marked as | ||
// aggregated in the failed request, then marked as replayed by a retried | ||
// request. In this case we cannot determine definitively if the report was | ||
// replayed, so we drop it to be safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer mark the reports as aggregated, so this logic for handling failed DO requests is no longer necessary. Retry should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, and I don't see any problems.
Loads that require a lot of CPU time can trigger the Workers runtime to reset itself, causing a 500 error. This is most likely to happen when processing the AggregationJobInitReq, as this involves the expensive, VDAF prep initialization step. Some reports may still get marked as aggregated; if the peer retries the request, those reports will get rejected. This problem is so common that we need to make this request idempotent. As a first step, wait to mark the reports as aggregated until just before we have committed to aggregating them. In particular, instead of doing this in `DapReportInitializer::initialize_reports()`, wait until `DapAggregator::put_out_shares()`.
01b1bf9
to
7c08da7
Compare
Loads that require a lot of CPU time can trigger the Workers runtime to reset itself, causing a 500 error. This is most likely to happen when processing the AggregationJobInitReq, as this involves the expensive, VDAF prep initialization step. Some reports may still get marked as aggregated; if the peer retries the request, those reports will get rejected.
This problem is so common that we need to make this request idempotent. As a first step, wait to mark the reports as aggregated until just before we have committed to aggregating them. In particular, instead of doing this in
DapReportInitializer::initialize_reports()
, wait untilDapAggregator::put_out_shares()
.