Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait to mark reports as aggregated until just before committing #374

Merged
merged 1 commit into from
Jul 21, 2023

Conversation

cjpatton
Copy link
Contributor

Loads that require a lot of CPU time can trigger the Workers runtime to reset itself, causing a 500 error. This is most likely to happen when processing the AggregationJobInitReq, as this involves the expensive, VDAF prep initialization step. Some reports may still get marked as aggregated; if the peer retries the request, those reports will get rejected.

This problem is so common that we need to make this request idempotent. As a first step, wait to mark the reports as aggregated until just before we have committed to aggregating them. In particular, instead of doing this in DapReportInitializer::initialize_reports(), wait until DapAggregator::put_out_shares().

max_time: out_share.time,
checksum: out_share.checksum,
data: Some(out_share.data),
})?;
Copy link
Contributor Author

@cjpatton cjpatton Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't aggregate the output shares yet because we still may need to reject in case of replay.

// replayed. This would happen if, for example, a fresh report was marked as
// aggregated in the failed request, then marked as replayed by a retried
// request. In this case we cannot determine definitively if the report was
// replayed, so we drop it to be safe.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer mark the reports as aggregated, so this logic for handling failed DO requests is no longer necessary. Retry should be safe.

@cjpatton cjpatton marked this pull request as ready for review July 21, 2023 19:09
Copy link
Contributor

@bhalleycf bhalleycf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and I don't see any problems.

Loads that require a lot of CPU time can trigger the Workers runtime to
reset itself, causing a 500 error. This is most likely to happen when
processing the AggregationJobInitReq, as this involves the expensive,
VDAF prep initialization step. Some reports may still get marked as
aggregated; if the peer retries the request, those reports will get
rejected.

This problem is so common that we need to make this request idempotent.
As a first step, wait to mark the reports as aggregated until just
before we have committed to aggregating them. In particular, instead of
doing this in `DapReportInitializer::initialize_reports()`, wait until
`DapAggregator::put_out_shares()`.
@cjpatton cjpatton force-pushed the cjpatton/defer-mark-aggregated branch from 01b1bf9 to 7c08da7 Compare July 21, 2023 19:51
@cjpatton cjpatton merged commit e6d1c2b into main Jul 21, 2023
4 checks passed
@cjpatton cjpatton deleted the cjpatton/defer-mark-aggregated branch July 21, 2023 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants