Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w… #22660

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kurtostfeld
Copy link

@kurtostfeld kurtostfeld commented May 25, 2023

…ith backward compatibility for existing savepoints and checkpoints.

What is the purpose of the change

To upgrade the primary Kryo library used by Flink from v2.x to v5.x, while providing backwards compatibility with existing savepoints and checkpoints. This PR adds a new Kryo v5 dependency that is namespaced so that it can coexist with the legacy dependencies that would be kept for compatibility purposes. Flink also depends on the Twitter chill (Scala) and chill-java libraries for additions + enhancements to Kryo v2.x. This would also be deprecated as most functionality is included in Kryo v5.x. Some future version of Flink could eventually drop Kryo v2 and the Twitter chill dependencies when backwards compatibility with Kryo v2 based state is no longer needed.

Why upgrade Kryo? One reason is support for Java 17 and 21. The existing Kryo 2.x is not compatible with Java 17/21, while Kryo 5.x is. When running in a JDK 17 runtime, I notice that ArraysAsListSerializer from chill-java fails under Java 17. Fixes can be back ported for that relatively easily, but there are more issues. Kryo 2.x doesn't support Java records at all.

A more broader reason is Flink should be using the new, actively maintained version of Kryo rather than the 10+ year old 2.x branch that stopped getting any updates almost ten years ago. Kryo v2.x was released before the release of Java 8. So it's quite old. Kryo is a maintained project, there have been lots of improvements over the past ten years. Kryo 5.x has faster runtime performance, more memory efficient serialization, fixed lots of bugs, added functionality, and improved compatibility with newer versions of Java. Kryo 5.x will also get more improvements in the future that will be fully compatible with existing Kryo 5.x serialized data.

Brief change log

This is a large PR with a lot of surface area and risk. I tried to keep the scope of these changes as narrow and simple as possible. I copied all the Kryo v2 code to Kryo v5 equivalents and made necessary adjustments to get everything working. Some highlights as to what was done:

All existing serialization class names and package names are unmodified

Some Flink serialization code references serialization classes by full package name, so the package and class names of all existing serialization classes are unmodified.

Added new version 7 to KeyedBackendSerializationProxy

In production Flink, version 6 is the current version. This PR adds version 7. The only difference between 6 and 7 is the Kryo upgrade. With version 6 serialized data, all Kryo state is Kryo 2.x. With version 7 serialized data, all Kryo state is Kryo 5.x

Added deserializeWithKeyedBackendVersion to TypeSerializer<T>

By default, this just calls the regular deserialize method. The Kryo 5.x version of Flink KryoSerializer will check the version number. For versions older than 7, this will call the Kryo 2.x version of the Flink KryoSerializer class.

Kryo 2.x Code Is not used outside of backwards compatibility scenarios

New serialized state will not be written with Kryo 2.x code. Some unit tests are still using the Kryo 2.x code, but the main code base is only using Kryo 2.x code for reading legacy state.

Verifying this change

This is passing the full test suite of automated tests in the CI system which covers lots of backwards compatibility scenarios.

Additionally, I wrote a Flink application to do a more thorough test of the Kryo upgrade that was difficult to convert into unit test form.

https://github.com/kurtostfeld/flink-kryo-upgrade-demo

If the Flink project is seriously considering accepting this PR, I plan to write more test scenarios for thorough backwards compatibility.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes. This adds a new Kryo v5 dependency and keeps legacy dependencies for backwards compatibility purposes.
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes. Kryo v2 APIs are deprecated. Parallel Kryo v5 APIs are created with PublicEvolving
  • The serializers: yes. This absolutely affects the serializers.
  • The runtime per-record code paths (performance sensitive): yes. This should be faster but I haven't done any benchmark testing.

@flinkbot
Copy link
Collaborator

flinkbot commented May 25, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@kurtostfeld kurtostfeld force-pushed the kryo-pr branch 5 times, most recently from 67f2abf to cda7fe1 Compare May 26, 2023 04:26
@zentol
Copy link
Contributor

zentol commented May 26, 2023

I appreciate the work, but really this change has to go through a proper design process.

There are open questions as to whether we really want to expose Kryo in the same way we did before (because the presence of Kryo serializers in the execution config is a problem in general).

Please create a FLIP and publish it on the dev mailing list.

@kurtostfeld
Copy link
Author

kurtostfeld commented May 26, 2023

@zentol ok, I created a Confluence login with username "kurto", however I don't seem to have permissions to create a new FLIP. I don't see a "Create" button as the docs say:

To create your own FLIP, click on "Create" on the header and choose "FLIP-Template" other than "Blank page".

Could I get permissions to do this? Thank you :)

@MartijnVisser
Copy link
Contributor

Could I get permissions to do this? Thank you :)

You should have permissions by now

@nicknezis
Copy link

nicknezis commented Jul 17, 2023

FYI, there is a pull request in the works to upgrade Twitter Chill to use Kryo 5.5.0. I don't think this pull request is blocked by the Chill pull request, but might be relevant, so I'm sharing the link. twitter/chill#747

…ith backward compatibility for existing savepoints and checkpoints.
@nicknezis
Copy link

I added my comment on the Jira ticket, but also posting it here for better visibility.

I wanted to revisit the status of this ticket. There was some discussion in the mailing list about merging this into Flink 2.0. If the master branch now is targeting 2.0, maybe we can finally move forward with this FLIP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants