Don't retry POSTs #101

lognaturel · 2024-11-04T19:24:50Z

Many of the resources in Central have no user-specified unique key. This means that running the same POST multiple times can result in multiple resources that are duplicates from the end user's perspective. For example, App Users only have a display name so there's no guard against creating 100 App Users with the same name.

This is particularly problematic with bulk Entity creation. If a user specifies a large CSV, it may take the server a little bit to process. pyodk may get a timeout. If it retries, it adds the same batch of Entities a second time.

The issue is introduced at

pyodk/pyodk/_utils/session.py

Line 54 in b2cffe3

allowed_methods=("GET", "PUT", "POST", "DELETE"),

@tobiasmcnulty has pointed out that using the default instead is likely a good fix:
https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry

lindsay-stevens · 2024-11-05T07:52:40Z

@lognaturel the client session request methods accept an optional timeout parameter. e.g.

from pyodk import Client
client = Client()
client.post("http://www.example.com", ..., timeout=90)

Requests are retried in case where there's a connection timeout or something. When this was added 2 years ago, it wasn't considered realistic that requests would take more 30 seconds to complete, i.e. a connection timeout was a more likely scenario. Do you want to disable retries for all POSTs or just that endpoint/method (or those that accept bulk data)?

lindsay-stevens · 2024-11-05T14:07:30Z

One other thing, maybe @tobiasmcnulty knows the details - when this POST retry issue was triggered, was Central responding to any other requests? i.e was it completely saturated with these bulk entity calls? Just wondering if there's a Central issue here too - like in this case maybe it should accept the request and reply HTTP/202 immediately but process it in the background.

tobiasmcnulty · 2024-11-05T14:36:31Z

The error in our case was a read timeout; i.e., the server merely hadn't finished processing the request yet. Central wasn't otherwise under high load, though it's possible our database server was, which slowed down the request processing. If Central knows its own server-side timeout and wants to send back a 202 right before the timeout is hit that might be a nice touch, but I probably wouldn't bother...the error is only occasional on our side, and like you said we can increase the client side timeout, at least, if it gets bothersome. (I still think pyodk should auto-retry only idempotent HTTP methods.)

Talking this through, another helpful change might be to increase the default pyodk timeout to something larger than Central's server side timeout. Then, barring connectivity issues, the client should always get back a status code that tells it whether the request was successful, was queued, or failed.

lognaturel · 2024-11-05T16:18:17Z

it wasn't considered realistic that requests would take more 30 seconds to complete

That makes sense. I think it's much more likely now for other endpoints like the OData endpoints. People have a lot of data now!

I still think pyodk should auto-retry only idempotent HTTP methods

This feels right to me too. The risk of issues may be low but I think they'd be hard to track down and may cause subtle cascading problems. There are some POST calls that do end up being idempotent but I think it's simplest to not retry any POST and let clients handle timeouts or other errors.

increase the default pyodk timeout to something larger than Central's server side timeout

It's at 120s https://github.com/getodk/central/blob/2d95a612218e7c73ee8b3e715749401f08fc00ac/files/nginx/odk.conf.template#L60

lindsay-stevens · 2024-11-05T20:36:29Z

OK we seem pretty set on switching to the default urllib3 method list. About the timeout, it seems sensible to match Central. Happy to be corrected on this but I think timeout applies to the original request and each retry request. Each retry has a incremental sleep time which 2 * 2 ^ retry_count. So for 120s the worst case would be about 8.5 minutes = 508s = 120 + (120 + (2 * 2^1)) + (120 + (2 * 2^2)) + (120 + (2 * 2^3)). For 30s it's 2.5min.

lognaturel assigned lindsay-stevens Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't retry POSTs #101

Don't retry POSTs #101

lognaturel commented Nov 4, 2024

lindsay-stevens commented Nov 5, 2024

lindsay-stevens commented Nov 5, 2024

tobiasmcnulty commented Nov 5, 2024

lognaturel commented Nov 5, 2024

lindsay-stevens commented Nov 5, 2024

Don't retry POSTs #101

Don't retry POSTs #101

Comments

lognaturel commented Nov 4, 2024

lindsay-stevens commented Nov 5, 2024

lindsay-stevens commented Nov 5, 2024

tobiasmcnulty commented Nov 5, 2024

lognaturel commented Nov 5, 2024

lindsay-stevens commented Nov 5, 2024