[FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w… #22660
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…ith backward compatibility for existing savepoints and checkpoints.
What is the purpose of the change
To upgrade the primary Kryo library used by Flink from v2.x to v5.x, while providing backwards compatibility with existing savepoints and checkpoints. This PR adds a new Kryo v5 dependency that is namespaced so that it can coexist with the legacy dependencies that would be kept for compatibility purposes. Flink also depends on the Twitter chill (Scala) and chill-java libraries for additions + enhancements to Kryo v2.x. This would also be deprecated as most functionality is included in Kryo v5.x. Some future version of Flink could eventually drop Kryo v2 and the Twitter chill dependencies when backwards compatibility with Kryo v2 based state is no longer needed.
Why upgrade Kryo? One reason is support for Java 17 and 21. The existing Kryo 2.x is not compatible with Java 17/21, while Kryo 5.x is. When running in a JDK 17 runtime, I notice that ArraysAsListSerializer from chill-java fails under Java 17. Fixes can be back ported for that relatively easily, but there are more issues. Kryo 2.x doesn't support Java records at all.
A more broader reason is Flink should be using the new, actively maintained version of Kryo rather than the 10+ year old 2.x branch that stopped getting any updates almost ten years ago. Kryo v2.x was released before the release of Java 8. So it's quite old. Kryo is a maintained project, there have been lots of improvements over the past ten years. Kryo 5.x has faster runtime performance, more memory efficient serialization, fixed lots of bugs, added functionality, and improved compatibility with newer versions of Java. Kryo 5.x will also get more improvements in the future that will be fully compatible with existing Kryo 5.x serialized data.
Brief change log
This is a large PR with a lot of surface area and risk. I tried to keep the scope of these changes as narrow and simple as possible. I copied all the Kryo v2 code to Kryo v5 equivalents and made necessary adjustments to get everything working. Some highlights as to what was done:
All existing serialization class names and package names are unmodified
Some Flink serialization code references serialization classes by full package name, so the package and class names of all existing serialization classes are unmodified.
Added new version
7
to KeyedBackendSerializationProxyIn production Flink, version 6 is the current version. This PR adds version 7. The only difference between 6 and 7 is the Kryo upgrade. With version 6 serialized data, all Kryo state is Kryo 2.x. With version 7 serialized data, all Kryo state is Kryo 5.x
Added
deserializeWithKeyedBackendVersion
toTypeSerializer<T>
By default, this just calls the regular
deserialize
method. The Kryo 5.x version of Flink KryoSerializer will check the version number. For versions older than 7, this will call the Kryo 2.x version of the Flink KryoSerializer class.Kryo 2.x Code Is not used outside of backwards compatibility scenarios
New serialized state will not be written with Kryo 2.x code. Some unit tests are still using the Kryo 2.x code, but the main code base is only using Kryo 2.x code for reading legacy state.
Verifying this change
This is passing the full test suite of automated tests in the CI system which covers lots of backwards compatibility scenarios.
Additionally, I wrote a Flink application to do a more thorough test of the Kryo upgrade that was difficult to convert into unit test form.
https://github.com/kurtostfeld/flink-kryo-upgrade-demo
If the Flink project is seriously considering accepting this PR, I plan to write more test scenarios for thorough backwards compatibility.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yes. Kryo v2 APIs are deprecated. Parallel Kryo v5 APIs are created with PublicEvolving