-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mysql: Decode text correctly in non-UTF8 character sets #1979
Commits on Sep 24, 2024
-
source-mysql: Add
TestUnicodeText
with various collationsThis is a new test case which exercises `text` columns with various different colations / character sets storing some test strings with interesting characters from various languages. I believe the current set of collations and test strings is enough to demonstrate the known issues with our current handling of these edge cases: - A `latin1` collation stores strings in the latin-1 character set and then we cast those raw bytes to a string which causes all of the non-ASCII characters to be replaced with U+FFFD. - A `ucs2` collation stores strings in the UCS-2 DBCS and so for similar reasons all replicated values are terribly mangled. - A `binary` collation is apparently captured as base64 bytes, because apparently if you tell MySQL `TEXT COLLATE binary` it creates a `BLOB` column. This is not an error but just seemed worth noting and including in the test here. The base64'd text appears to be a faithful base64 representation of the input as a UTF-8 string.
Configuration menu - View commit details
-
Copy full SHA for 7321207 - Browse repository at this point
Copy the full SHA 7321207View commit details -
source-mysql: Make discovery debug logging more useful
Modifies the column-discovery and primary-key-discovery queries to apply the same "not in a system schema" filtering that the table discovery query has, and then modifies column discovery so that the raw information is logged for each column.
Configuration menu - View commit details
-
Copy full SHA for 3baa39b - Browse repository at this point
Copy the full SHA 3baa39bView commit details -
source-mysql: Discover text column character sets
This commit adds most of the necessary plumbing to keep track of the character set of a text column and apply a charset-aware decode function instead of just casting the bytes to a string. However it does not actually implement the proper decoding and instead just still does `var str = string(bs)` at the appropriate spot with a TODO noting that's wrong. The next commit will come in and actually implement proper decoding.
Configuration menu - View commit details
-
Copy full SHA for b1c6075 - Browse repository at this point
Copy the full SHA b1c6075View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4ec1ce5 - Browse repository at this point
Copy the full SHA 4ec1ce5View commit details -
source-mysql: Add test case of DDL adding non-UTF8 text column
Currently this doesn't work and I've stubbed out the logic with a hard-coded default of `utf8mb4` but now there's a test case which will show the corrected values when I fix that.
Configuration menu - View commit details
-
Copy full SHA for c0befef - Browse repository at this point
Copy the full SHA c0befefView commit details -
source-mysql: Add a test case of backfilling a latin1 text key
This works just fine, which is to be expected because we haven't changed anything and we already knew that latin-1 character set text columns work fine under backfills.
Configuration menu - View commit details
-
Copy full SHA for 5e91609 - Browse repository at this point
Copy the full SHA 5e91609View commit details -
Configuration menu - View commit details
-
Copy full SHA for f560b57 - Browse repository at this point
Copy the full SHA f560b57View commit details -
source-mysql: Assume unknown charsets are UTF-8 compatible
This more or less preserves the old behavior for any charsets we haven't thought to add to the decoders map, but logs an error so I can check for it in a few days and add any others we might be missing.
Configuration menu - View commit details
-
Copy full SHA for 1b52701 - Browse repository at this point
Copy the full SHA 1b52701View commit details -
source-mysql: Keep track of the default table collation
And use that collation when processing DDL alterations which don't explicitly specify another collation or charset.
Configuration menu - View commit details
-
Copy full SHA for 55ca5a0 - Browse repository at this point
Copy the full SHA 55ca5a0View commit details -
source-mysql: Fix DDL with COLLATE but not CHARSET
In this case we want to apply the same "charset from collation" mapping function that we use during discovery. Now the hierarchy of column charsets goes: 1. Explicit CHARSET declaration 2. Explicit COLLATE declaration 3. Default for the table (which omits utf8mb4 in some cases) 4. Default to utf8mb4 as the last resort
Configuration menu - View commit details
-
Copy full SHA for 2dfcec0 - Browse repository at this point
Copy the full SHA 2dfcec0View commit details
Commits on Sep 30, 2024
-
source-mysql: Always include charset in new metadata
It was always the plan to include the table charset in the table metadata unconditionally, I just left that part for a followup commit to keep the diffs separate. This removes the `"utf8mb4" -> ""` omission from the replication code so that it's always made explicit what charset a table is using in metadata initialized after this change. The default of `"" -> "utf8mb4"` still exists in the DDL alteration datatype translation so that will always be explicit for newly added columns, and likewise the string decoding function defaults `"" -> "utf8mb4"` so that old metadata works, but after this change we always specify charset information explicitly when generating new metadata.
Configuration menu - View commit details
-
Copy full SHA for 391d7c7 - Browse repository at this point
Copy the full SHA 391d7c7View commit details