Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Decode text correctly in non-UTF8 character sets #1979

Merged
merged 11 commits into from
Sep 30, 2024

Commits on Sep 24, 2024

  1. source-mysql: Add TestUnicodeText with various collations

    This is a new test case which exercises `text` columns with various
    different colations / character sets storing some test strings with
    interesting characters from various languages.
    
    I believe the current set of collations and test strings is enough
    to demonstrate the known issues with our current handling of these
    edge cases:
    
      - A `latin1` collation stores strings in the latin-1 character
        set and then we cast those raw bytes to a string which causes
        all of the non-ASCII characters to be replaced with U+FFFD.
      - A `ucs2` collation stores strings in the UCS-2 DBCS and so for
        similar reasons all replicated values are terribly mangled.
      - A `binary` collation is apparently captured as base64 bytes,
        because apparently if you tell MySQL `TEXT COLLATE binary` it
        creates a `BLOB` column. This is not an error but just seemed
        worth noting and including in the test here. The base64'd text
        appears to be a faithful base64 representation of the input as
        a UTF-8 string.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    7321207 View commit details
    Browse the repository at this point in the history
  2. source-mysql: Make discovery debug logging more useful

    Modifies the column-discovery and primary-key-discovery queries
    to apply the same "not in a system schema" filtering that the
    table discovery query has, and then modifies column discovery
    so that the raw information is logged for each column.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    3baa39b View commit details
    Browse the repository at this point in the history
  3. source-mysql: Discover text column character sets

    This commit adds most of the necessary plumbing to keep track of
    the character set of a text column and apply a charset-aware
    decode function instead of just casting the bytes to a string.
    
    However it does not actually implement the proper decoding and
    instead just still does `var str = string(bs)` at the appropriate
    spot with a TODO noting that's wrong. The next commit will come
    in and actually implement proper decoding.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    b1c6075 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    4ec1ce5 View commit details
    Browse the repository at this point in the history
  5. source-mysql: Add test case of DDL adding non-UTF8 text column

    Currently this doesn't work and I've stubbed out the logic with
    a hard-coded default of `utf8mb4` but now there's a test case
    which will show the corrected values when I fix that.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    c0befef View commit details
    Browse the repository at this point in the history
  6. source-mysql: Add a test case of backfilling a latin1 text key

    This works just fine, which is to be expected because we haven't
    changed anything and we already knew that latin-1 character set
    text columns work fine under backfills.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    5e91609 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f560b57 View commit details
    Browse the repository at this point in the history
  8. source-mysql: Assume unknown charsets are UTF-8 compatible

    This more or less preserves the old behavior for any charsets we
    haven't thought to add to the decoders map, but logs an error so
    I can check for it in a few days and add any others we might be
    missing.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    1b52701 View commit details
    Browse the repository at this point in the history
  9. source-mysql: Keep track of the default table collation

    And use that collation when processing DDL alterations which
    don't explicitly specify another collation or charset.
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    55ca5a0 View commit details
    Browse the repository at this point in the history
  10. source-mysql: Fix DDL with COLLATE but not CHARSET

    In this case we want to apply the same "charset from collation"
    mapping function that we use during discovery. Now the hierarchy
    of column charsets goes:
    
    1. Explicit CHARSET declaration
    2. Explicit COLLATE declaration
    3. Default for the table (which omits utf8mb4 in some cases)
    4. Default to utf8mb4 as the last resort
    willdonnelly committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    2dfcec0 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. source-mysql: Always include charset in new metadata

    It was always the plan to include the table charset in the table
    metadata unconditionally, I just left that part for a followup
    commit to keep the diffs separate.
    
    This removes the `"utf8mb4" -> ""` omission from the replication
    code so that it's always made explicit what charset a table is
    using in metadata initialized after this change. The default of
    `"" -> "utf8mb4"` still exists in the DDL alteration datatype
    translation so that will always be explicit for newly added
    columns, and likewise the string decoding function defaults
    `"" -> "utf8mb4"` so that old metadata works, but after this
    change we always specify charset information explicitly when
    generating new metadata.
    willdonnelly committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    391d7c7 View commit details
    Browse the repository at this point in the history