[FLINK-36455] Sinks retry synchronously #25547

AHeise · 2024-10-18T13:28:39Z

What is the purpose of the change

Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call.

Brief change log

This commit retries a fixed number of times and then fails in notifyCheckpointCompleted.
Simplifies parts of committable handling now that all committables of a subtask either succeed or fail

Verifying this change

Already covered by many tests
Adjusted and changed tests in
api/connector/sink2
runtime/operators/sink/committables

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Use more of the assertj native patterns to compare results.

Global committers used to trail a full checkpoint behind committer. That means that data appeared only after >2*checkpoint interval in the sinks that use it (e.g. delta). However, committables in the global committers are already part of the first checkpoint and are idempotent: On recovery, they are resend from the committer to the global committer. Thus, the global committer can actually be seen as stateless and doesn't need to conduct its own 2PC protocol. This commit lets the global committer collect all committables on input (as before) but immediately tries to commit when it has received all (deducible from CommitterSummary - which was always the original intent of that message). Thus, in most cases, GlobalCommitter ignores notifyCheckpointCompleted now as the state of the checkpoint can be inferred by received committables from upstream. There are special cases where a global committer is directly chained to a writer. In this case, the global committer does need to conduct a 2PC protocol in place of the committer. To differentiate these cases, the global committer now has its own transformation.

Without UCs, a committer doesn't need to do anything on #processInput except collecting. It emits only on notifyCheckpointCompleted (or endInput for batch). We can also harden some contracts: * NotifyCheckpointCompleted can assert that all committables are received. * Emit committables downstream only if all committables are finished.

Sinks so far retried asynchronously to increase commit throughput in case of temporary issues. However, the contract of notifyCheckpointCompleted states that checkpoints must be side-effect free meaning all transactions have to be committed on return of the PRC call. This commit retries a fixed number of times and then fails in notifyCheckpointCompleted. Note that sync retries significantly simplify the committable handling. This commit starts a few simplifications; the next commit clears up more.

Without async parts of committable summary, number of pending committables will always be 0. Failed committables will also be 0 as they will throw an error if unexpected or not they are silently ignored. The previous behavior with them being >0 actually led to infinite loops in the global committer.

flinkbot · 2024-10-18T13:35:23Z

CI report:

c4672fc Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

AHeise and others added 7 commits October 16, 2024 12:33

[FLINK-36379] Refactor sink test assertions

90bed75

Use more of the assertj native patterns to compare results.

fixup! [FLINK-36379] Optimize global committers

1d91ea0

[FLINK-36379] Addressed Fabian's feedback

e71c4a4

flinkbot added the component=API/Core label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36455] Sinks retry synchronously #25547

[FLINK-36455] Sinks retry synchronously #25547

AHeise commented Oct 18, 2024

flinkbot commented Oct 18, 2024 •

edited

Loading

[FLINK-36455] Sinks retry synchronously #25547

Are you sure you want to change the base?

[FLINK-36455] Sinks retry synchronously #25547

Conversation

AHeise commented Oct 18, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Oct 18, 2024 • edited Loading

CI report:

flinkbot commented Oct 18, 2024 •

edited

Loading