Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQS - Failure to restore unacked messages #1835

Open
YahelOpp opened this issue Dec 19, 2023 · 4 comments
Open

SQS - Failure to restore unacked messages #1835

YahelOpp opened this issue Dec 19, 2023 · 4 comments

Comments

@YahelOpp
Copy link

Versions:

  • Python 3.10
  • Kombu 5.3.4
  • Celery 5.3.6
  • Boto3 1.33.7

After seeing messages stuck in NotVisible state at our production queues, I've set out to verify the warm shutdown mechanism works properly. My test setup:

  • Simple task: sleeps for a desired amount of seconds, then retries with a countdown of 10.
  • Max retries: 3
  • Single worker, threads pool, concurrency 5

What I see happens:

  • A task is run
  • It raises the retry -> putting a new message in the queue
  • The message is consumed by the same worker directly (message received)
  • During the countdown period, the message is in NotVisible state
  • During the countdown period, I shutdown the worker with SIGTERM
  • The worker successfully starts a Warm shutdown
  • The worker notes that it is restoring 1 unacked message
  • The message stays in NotVisible state

Expected behavior: the message should move to Visible state.

A possible explanation I've found:
By manually changing the code in kombu/transport/SQS.py at the _put function in the case message.get('redelivered'), adding a duplicate call to change_message_visibility - the problem seemed to go away. I've tested this against real live SQS and against Localstack. In both the problem was consistent and solved once I added the duplicate call.

To sum up, the problem might be on boto3's side and not in kombu, but I'm unsure about this.
I will be happy to assist and make a PR / provide a simple testing scenario.

Thanks

@YahelOpp
Copy link
Author

The patch I've done that makes it work as expected is:


# Patch a fix for SQS channel
original_put = Transport.Channel._put

def patched_put(self, queue, message, **kwargs):
    original_put(self, queue, message, **kwargs)
    if message.get("redelivered"):
        q_url = self._new_queue(queue)
        c = self.sqs(queue=self.canonical_queue_name(queue))
        c.change_message_visibility(
            QueueUrl=q_url,
            ReceiptHandle=message['properties']['delivery_tag'],
            VisibilityTimeout=0
        )


Transport.Channel._put = patched_put

@pbudzon
Copy link

pbudzon commented Jul 1, 2024

I think this is a similar issue to what we're seeing, but we see that during a cold shutdown. Our scenario is:

  • run celery with concurrency > 1
  • run a long(ish) running task
  • send SIGQUIT to celery while task is running
  • see Restoring X unacknowledged message(s) warning in the log
  • the message still shows as in-flight (unavailable, invisible) in the SQS

Relevant settings:

task_reject_on_worker_lost: True
task_acks_late: True
ReceiveMessageWaitTimeSeconds on the queue: 20 seconds

From my investigation it seems to be caused by a race condition that happens because of the long polling and/or multiple workers (concurrency > 1).
In CloudTrail I can see that the ChangeMessageVisibility call happens correctly, but there's also another ReceiveMessage call up to 3 seconds later from the same host, which puts the message back into invisible state.

Work-arounds I've found:

  1. Changing concurrency to 1 prevents this from happening - not really a solution for production environments.
  2. Using short polling: polling_interval set to 0 also solves this. But that can have severe cost implications, so it's also not a viable solution.

I think this could be easily prevented by adjusting the value in the ChangeVisibilityTimeout call to change it to set VisibilityTimeout to whatever the polling interval is.
That would hide the message from any existing long polling requests (so also the other workers that are currently being shut down) long enough for the shutdown to complete. Even if the polling interval is set to maximum 20 seconds, that's usually still much better than whatever VisibilityTimeout is set to (which is 30 minutes in our case because of long-running tasks!).

Obviously, a better solution would be to restore the messages after all workers under the master process finish/stop their ReceiveMessage requests - but I'm not sure if that's possible with boto?

Possibly related to: celery/celery#8583

software -> celery:5.3.6 (emerald-rush) kombu:5.3.7 py:3.9.19
            billiard:4.2.0 sqs:N/A
platform -> system:Linux arch:64bit, ELF
            kernel version:5.10.218-208.862.amzn2.x86_64 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled

@Nusnus
Copy link
Member

Nusnus commented Aug 30, 2024

Potentially fixed with celery/celery#9213 (although it’s in Celery, not Kombu) @YahelOpp

@Nusnus
Copy link
Member

Nusnus commented Sep 8, 2024

Potentially fixed with celery/celery#9213 (although it’s in Celery, not Kombu) @YahelOpp

Celery v5.5.0b3 released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants