Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry POSTs #101

Open
lognaturel opened this issue Nov 4, 2024 · 5 comments
Open

Don't retry POSTs #101

lognaturel opened this issue Nov 4, 2024 · 5 comments
Assignees

Comments

@lognaturel
Copy link
Member

Many of the resources in Central have no user-specified unique key. This means that running the same POST multiple times can result in multiple resources that are duplicates from the end user's perspective. For example, App Users only have a display name so there's no guard against creating 100 App Users with the same name.

This is particularly problematic with bulk Entity creation. If a user specifies a large CSV, it may take the server a little bit to process. pyodk may get a timeout. If it retries, it adds the same batch of Entities a second time.

The issue is introduced at

allowed_methods=("GET", "PUT", "POST", "DELETE"),

@tobiasmcnulty has pointed out that using the default instead is likely a good fix:
https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry

@lindsay-stevens
Copy link
Contributor

@lognaturel the client session request methods accept an optional timeout parameter. e.g.

from pyodk import Client
client = Client()
client.post("http://www.example.com", ..., timeout=90)

Requests are retried in case where there's a connection timeout or something. When this was added 2 years ago, it wasn't considered realistic that requests would take more 30 seconds to complete, i.e. a connection timeout was a more likely scenario. Do you want to disable retries for all POSTs or just that endpoint/method (or those that accept bulk data)?

@lindsay-stevens
Copy link
Contributor

One other thing, maybe @tobiasmcnulty knows the details - when this POST retry issue was triggered, was Central responding to any other requests? i.e was it completely saturated with these bulk entity calls? Just wondering if there's a Central issue here too - like in this case maybe it should accept the request and reply HTTP/202 immediately but process it in the background.

@tobiasmcnulty
Copy link
Contributor

The error in our case was a read timeout; i.e., the server merely hadn't finished processing the request yet. Central wasn't otherwise under high load, though it's possible our database server was, which slowed down the request processing. If Central knows its own server-side timeout and wants to send back a 202 right before the timeout is hit that might be a nice touch, but I probably wouldn't bother...the error is only occasional on our side, and like you said we can increase the client side timeout, at least, if it gets bothersome. (I still think pyodk should auto-retry only idempotent HTTP methods.)

Talking this through, another helpful change might be to increase the default pyodk timeout to something larger than Central's server side timeout. Then, barring connectivity issues, the client should always get back a status code that tells it whether the request was successful, was queued, or failed.

@lognaturel
Copy link
Member Author

it wasn't considered realistic that requests would take more 30 seconds to complete

That makes sense. I think it's much more likely now for other endpoints like the OData endpoints. People have a lot of data now!

I still think pyodk should auto-retry only idempotent HTTP methods

This feels right to me too. The risk of issues may be low but I think they'd be hard to track down and may cause subtle cascading problems. There are some POST calls that do end up being idempotent but I think it's simplest to not retry any POST and let clients handle timeouts or other errors.

increase the default pyodk timeout to something larger than Central's server side timeout

It's at 120s https://github.com/getodk/central/blob/2d95a612218e7c73ee8b3e715749401f08fc00ac/files/nginx/odk.conf.template#L60

@lindsay-stevens
Copy link
Contributor

OK we seem pretty set on switching to the default urllib3 method list. About the timeout, it seems sensible to match Central. Happy to be corrected on this but I think timeout applies to the original request and each retry request. Each retry has a incremental sleep time which 2 * 2 ^ retry_count. So for 120s the worst case would be about 8.5 minutes = 508s = 120 + (120 + (2 * 2^1)) + (120 + (2 * 2^2)) + (120 + (2 * 2^3)). For 30s it's 2.5min.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants