Prevent clearing topic-partitions that are still assigned #648
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
To decrease the impact of rebalances during rolling bounces of k8s pods, we changed the
partition.assignment.strategy
from the defaultRangeAssignor
toCooperativeStickyAssignor
. After this change we encountered NPEs and theS3SinkTask
goes into an unrecoverable state. We did not find the same issue withStickyAssignor
however.Example of an NPE (this is with
v10.0.7
):Solution
WorkerSinkTask
always sends a list oftopicPartitions
onclose
. We currently clear all the assignedtopicPartitionWriter
s onclose()
. This worked fine with stop-the-world rebalance strategies likeRangeAssignor
orStickyAssignor
, since the current assignment would be fully closed. But withCooperativeStickyAssignor
only a fewtopicPartition
s could be reassigned/closed. In such a scenario clearing out alltopicPartitionWriter
s is causing NPEs.I am not sure if there is some historical context that I might be missing here and the
.clear()
is deliberate, could not find clues from commit history.Test Strategy
Testing done:
Did not specifically write any tests for this case, nor am I aware of any existing tests that test assignment strategies. Open to ideas on any necessary tests. We have applied this path for the past few days and dont see the same degradation.