Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and guarantee partial RFC 8785 compatibility for serialization #1197

Open
casey opened this issue Oct 11, 2024 · 0 comments
Open

Document and guarantee partial RFC 8785 compatibility for serialization #1197

casey opened this issue Oct 11, 2024 · 0 comments

Comments

@casey
Copy link

casey commented Oct 11, 2024

I'm using serde_json in an application which requires serialized JSON for the same value to have a consistent hash. This can be accomplished by using a JSON canonicalization scheme, such as RFC 8785.

Issue #309 requested the ability to opt-in to canonical JSON serialization, but was closed as outside the scope of this library.

I think this library is very close to producing canonical JSON, and where it doesn't, there are easy workarounds available.

I think it would be valuable to document this, and, if possible, guarantee it, so that users can rely on this partial compatibility. This would be for serialization only, since adding checks that deserialized JSON is canonical would be a big lift. Serialization-only compatibility is, still, very valuable, since if an application always hashes JSON that it produces itself with serde_json, if it follows the workarounds, it can rely on that JSON being canonical.

To summarize RFC 8785:

  1. No inter-token whitespace allowed.
  2. Literals: null, true, and false are always serialized as null, true, and false.
  3. Strings: All characters which have a dedicated escape (i.e. \n) are serialized with that dedicated escape character. Control characters in the range U+0000 through U+001F are serialized as \uHHHH, and all other characters are serialized as-is.
  4. Numbers: Here is where the spec gets crazy. It defers to ECMA-262 for number serialization. However, all that complexity is for floating point numbers, integers are serialized in the normal way.
  5. Object properties: Object properties must be sorted. Unfortunately, the spec requires that object properties be sorted as arrays of UTF-16 code points.

This yields a small number of workarounds that a current user of serde_json can use to produce canonical JSON;

  1. Don't use pretty-printing.
  2. No workaround needed, literals are already serialized as their canonical representation.
  3. No workaround needed, strings are already serialized as their canonical representation.
  4. Don't use floats. Integers are already serialized as their canonical representation.
  5. When serializing a Value, don't use preserve_order, and don't use object properties with codepoints outside 0-127, which may have a different UTF-8 and UTF-16 sort order. When serializing with the derive macro, manually sort struct fields.

I think this workarounds are actually pretty easy to follow, and being able to rely on partial RFC 8785 compatibility would be valuable.

So, I think my proposal would be, as a first step, to document and guarantee those places where this library is RFC 8785 compatible, in particular:

  • No inter-token whitespace is added when not pretty printing.
  • Serialized literals are guaranteed to be canonical.
  • Serialized strings are guaranteed to be canonical.
  • Serialized integers are guaranteed to be canonical.

This would be super valuable, at least to me, since the above guarantees would make it easy for me to produce canonical JSON.

In addition, tests should be added to ensure that these things are actually true and stay true. In particular, adding tests for arbitrary precision integers, which I'm not sure are canonical, and tests for all the string edge cases would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant