Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how a Kubernetes-style liveness probe may be implemented #309

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

meqif
Copy link

@meqif meqif commented Dec 12, 2022

Due to a bug in rdkafka, sometimes consumers stop processing one or more assigned partitions. That sort of issue is not detectable through a consumer-wide liveness check because an individual consumer may keep making progress in some but not all assigned partitions.

In a system owned by me, I tried to leverage consumer rebalance listeners (via monkey patching) to introduce a liveness probe that is able to detect lack of progress in individual partition.

The core idea is that the progress in each partition is tracked through the timestamp of some specific files and a Kubernetes liveness probe checks their age. After some failures (defined in the probe), the Kubernetes pod is marked as unhealthy by the probe and automatically restarted, which fixes the issue.

In more detail:

  • the consumer object refreshes a "partition liveness" file whenever it processes a new message
  • a custom consumer rebalance listener reaps all existing "partition liveness" files on partition (re-) assignment
    • it would be possible to manage these files in a more targeted manner with the on_partitions_revoked handler but ordering could cause issues; the simpler, blunt approach I documented is simpler and works correctly in practice
  • the Kubernetes liveness probe executes a small bash script that checks if any of the "partition liveness" files are older than some maximum age

This required adding some basic support for registering a consumer rebalance listener, which in the system I own is actually implemented through a monkey-patch.

/cc @jbdietrich

README.md Outdated Show resolved Hide resolved
@jbdietrich
Copy link
Contributor

@meqif sorry it's taken us so long to do a proper review of this. I think this is the current state of affairs:

We now have three open PRs that add liveness probe functionality (or at least document it):

#319 seems like the best fleshed-out implementation, so we will standardize on that as a documented solution.

However, the general idea in this PR of exposing an API to respond to rebalances is valuable outside of the context of a K8s liveness probe. Would you be willing to extract that part of the PR for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants