Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update collation list up to MySQL 8.0.26 #1410

Merged
merged 3 commits into from
Mar 15, 2022

Conversation

testn
Copy link
Contributor

@testn testn commented Oct 12, 2021

I noticed that some collations are missing so I pulled up the latest collation list from MySQL by

mysql> select concat('exports.',UPPER(COLLATION_NAME),' = ',id,';')
    ->        FROM INFORMATION_SCHEMA.COLLATIONS
    ->        ORDER BY ID;

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

utf8_general50_ci and ucs2_general50_ci seem to be removed.

https://bugs.launchpad.net/percona-server/+bug/1163324

@sidorares
Copy link
Owner

while you at it - can you have a look at making utf8mb3 an alias for cesu-8? should fix #1398 #1240 #1333

@sidorares
Copy link
Owner

utf8_general50_ci and ucs2_general50_ci seem to be removed

would it hurt leaving them? not sure if removing might affect user connecting to server version 5.0

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

I think utf8mb3 should be aliased to utf8 for now. Aliasing it to cesu-8 will hurt the performance significantly

@sidorares
Copy link
Owner

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

Plus add 3 more mappings for utf8mb4 307,308,309
@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

I think that is the future. Currently, it is still utf8mb3 (https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html)

Also this seems to clarify the differences between cesu8 and utf8mb3 https://www.wikiwand.com/en/Talk:UTF-8

@@ -104,7 +104,7 @@ module.exports = [
'eucjpms',
'eucjpms',
'cp1250',
'utf8',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is utf16_unicode_ci. I don't think it should be encoded as utf-8 at all.

@sidorares
Copy link
Owner

lets tackle utf8 / utf8mb3 / cesu-8 separately. Thanks for the link, interesting read!

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

I just tried to store 😂 into utf8_unicode_ci and it said

    code: 'ER_TRUNCATED_WRONG_VALUE_FOR_FIELD',
    errno: 1366,
    sqlState: 'HY000',
    sqlMessage: "Incorrect string value: '\\xF0\\x9F\\x98\\x82' for column 'field' at row 1",
    sql: "INSERT INTO `test-charset-encoding` (field) values('😂')"

@sidorares
Copy link
Owner

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f ? )

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f ? )

I think it's good. It shows that utf8mb3 cannot store non-BMP characters.

@sidorares
Copy link
Owner

I believe I was able to insert non-BMP 💩 using utf8 mysql encoding ( see #374 (comment) ), so not sure if that confirms "mysql utf8 = cesu-8", utf8mb3 = same as cecu-8 but only BMP is supported, utf8mb4 = "normal modern utf8"

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

if you do it that way, you are converting 💩 into 6 invalid bytes (2 UTF-8 characters) in MySQL. MySQL may not be able to understand that character. It may be able to pass it in/out as if it is a two character though.

However, I believe if you try to insert 💩 using .query() method, you won't be able to to that, right?

@sidorares
Copy link
Owner

unit test:

const connection = common.createConnection({ charset: 'UTF8_GENERAL_CI' });
connection.query('select "💩"', (err, rows, fields) => {
assert.ifError(err);
assert.equal(fields[0].name, pileOfPoo);
assert.equal(rows[0][fields[0].name], pileOfPoo);
connection.end();
});

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

Can you try to insert 💩 into a column with UTF8_GENERAL_CI?

@testn
Copy link
Contributor Author

testn commented Oct 12, 2021

@sidorares re:columnname, I think we should always use utf8 to decode the column name regardless the characterSet returned in Packet.FieldDefinition. We don't need to use cesu8 to decode it as the column should only contain BMP characters only. So whether you use cesu8 or utf8, they won't be any differences.

@sidorares
Copy link
Owner

can column name use other encodings ( win1251, koi8 for Cyrillic as an example )? If yes not sure if worth adding exception / custom logic for perf benefits

@testn
Copy link
Contributor Author

testn commented Jan 10, 2022

Let me check.

@ahmedbodi
Copy link

any update on this?

@sidorares
Copy link
Owner

@ahmedbodi I think the discussion got sidetracked a bit into more complex issue, updated collation list LGTM to me. Merging now

@sidorares sidorares merged commit 810e6a4 into sidorares:master Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants