source-mysql: Decode text correctly in non-UTF8 character sets #1979

This is a new test case which exercises `text` columns with various different colations / character sets storing some test strings with interesting characters from various languages. I believe the current set of collations and test strings is enough to demonstrate the known issues with our current handling of these edge cases: - A `latin1` collation stores strings in the latin-1 character set and then we cast those raw bytes to a string which causes all of the non-ASCII characters to be replaced with U+FFFD. - A `ucs2` collation stores strings in the UCS-2 DBCS and so for similar reasons all replicated values are terribly mangled. - A `binary` collation is apparently captured as base64 bytes, because apparently if you tell MySQL `TEXT COLLATE binary` it creates a `BLOB` column. This is not an error but just seemed worth noting and including in the test here. The base64'd text appears to be a faithful base64 representation of the input as a UTF-8 string.

Modifies the column-discovery and primary-key-discovery queries to apply the same "not in a system schema" filtering that the table discovery query has, and then modifies column discovery so that the raw information is logged for each column.

This commit adds most of the necessary plumbing to keep track of the character set of a text column and apply a charset-aware decode function instead of just casting the bytes to a string. However it does not actually implement the proper decoding and instead just still does `var str = string(bs)` at the appropriate spot with a TODO noting that's wrong. The next commit will come in and actually implement proper decoding.

Currently this doesn't work and I've stubbed out the logic with a hard-coded default of `utf8mb4` but now there's a test case which will show the corrected values when I fix that.

This works just fine, which is to be expected because we haven't changed anything and we already knew that latin-1 character set text columns work fine under backfills.

This more or less preserves the old behavior for any charsets we haven't thought to add to the decoders map, but logs an error so I can check for it in a few days and add any others we might be missing.

And use that collation when processing DDL alterations which don't explicitly specify another collation or charset.

In this case we want to apply the same "charset from collation" mapping function that we use during discovery. Now the hierarchy of column charsets goes: 1. Explicit CHARSET declaration 2. Explicit COLLATE declaration 3. Default for the table (which omits utf8mb4 in some cases) 4. Default to utf8mb4 as the last resort

It was always the plan to include the table charset in the table metadata unconditionally, I just left that part for a followup commit to keep the diffs separate. This removes the `"utf8mb4" -> ""` omission from the replication code so that it's always made explicit what charset a table is using in metadata initialized after this change. The default of `"" -> "utf8mb4"` still exists in the DDL alteration datatype translation so that will always be explicit for newly added columns, and likewise the string decoding function defaults `"" -> "utf8mb4"` so that old metadata works, but after this change we always specify charset information explicitly when generating new metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-mysql: Decode text correctly in non-UTF8 character sets #1979

source-mysql: Decode text correctly in non-UTF8 character sets #1979

Commits on Sep 24, 2024

Commits on Sep 30, 2024