Update collation list up to MySQL 8.0.26 #1410

testn · 2021-10-12T03:36:09Z

I noticed that some collations are missing so I pulled up the latest collation list from MySQL by

mysql> select concat('exports.',UPPER(COLLATION_NAME),' = ',id,';')
    ->        FROM INFORMATION_SCHEMA.COLLATIONS
    ->        ORDER BY ID;

testn · 2021-10-12T03:38:05Z

utf8_general50_ci and ucs2_general50_ci seem to be removed.

https://bugs.launchpad.net/percona-server/+bug/1163324

sidorares · 2021-10-12T03:41:22Z

while you at it - can you have a look at making utf8mb3 an alias for cesu-8? should fix #1398 #1240 #1333

sidorares · 2021-10-12T03:43:27Z

utf8_general50_ci and ucs2_general50_ci seem to be removed

would it hurt leaving them? not sure if removing might affect user connecting to server version 5.0

testn · 2021-10-12T03:51:09Z

I think utf8mb3 should be aliased to utf8 for now. Aliasing it to cesu-8 will hurt the performance significantly

sidorares · 2021-10-12T04:12:53Z

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

Plus add 3 more mappings for utf8mb4 307,308,309

testn · 2021-10-12T04:23:09Z

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

I think that is the future. Currently, it is still utf8mb3 (https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html)

Also this seems to clarify the differences between cesu8 and utf8mb3 https://www.wikiwand.com/en/Talk:UTF-8

testn · 2021-10-12T04:28:04Z

lib/constants/charset_encodings.js

@@ -104,7 +104,7 @@ module.exports = [
  'eucjpms',
  'eucjpms',
  'cp1250',
-  'utf8',


This is utf16_unicode_ci. I don't think it should be encoded as utf-8 at all.

sidorares · 2021-10-12T05:04:55Z

lets tackle utf8 / utf8mb3 / cesu-8 separately. Thanks for the link, interesting read!

testn · 2021-10-12T05:32:56Z

I just tried to store 😂 into utf8_unicode_ci and it said

    code: 'ER_TRUNCATED_WRONG_VALUE_FOR_FIELD',
    errno: 1366,
    sqlState: 'HY000',
    sqlMessage: "Incorrect string value: '\\xF0\\x9F\\x98\\x82' for column 'field' at row 1",
    sql: "INSERT INTO `test-charset-encoding` (field) values('😂')"

sidorares · 2021-10-12T09:38:41Z

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f ? )

testn · 2021-10-12T09:57:10Z

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f ? )

I think it's good. It shows that utf8mb3 cannot store non-BMP characters.

sidorares · 2021-10-12T10:04:54Z

I believe I was able to insert non-BMP 💩 using utf8 mysql encoding ( see #374 (comment) ), so not sure if that confirms "mysql utf8 = cesu-8", utf8mb3 = same as cecu-8 but only BMP is supported, utf8mb4 = "normal modern utf8"

testn · 2021-10-12T10:11:46Z

if you do it that way, you are converting 💩 into 6 invalid bytes (2 UTF-8 characters) in MySQL. MySQL may not be able to understand that character. It may be able to pass it in/out as if it is a two character though.

However, I believe if you try to insert 💩 using .query() method, you won't be able to to that, right?

sidorares · 2021-10-12T10:23:25Z

unit test:

node-mysql2/test/integration/connection/encoding/test-non-bmp-chars.js

Lines 9 to 15 in 694e100

    
           const connection = common.createConnection({ charset: 'UTF8_GENERAL_CI' }); 
        
           connection.query('select "💩"', (err, rows, fields) => { 
        
             assert.ifError(err); 
        
             assert.equal(fields[0].name, pileOfPoo); 
        
             assert.equal(rows[0][fields[0].name], pileOfPoo); 
        
             connection.end(); 
        
           });

testn · 2021-10-12T10:49:57Z

Can you try to insert 💩 into a column with UTF8_GENERAL_CI?

testn · 2021-10-12T11:04:24Z

@sidorares re:columnname, I think we should always use utf8 to decode the column name regardless the characterSet returned in Packet.FieldDefinition. We don't need to use cesu8 to decode it as the column should only contain BMP characters only. So whether you use cesu8 or utf8, they won't be any differences.

sidorares · 2021-10-12T11:16:22Z

can column name use other encodings ( win1251, koi8 for Cyrillic as an example )? If yes not sure if worth adding exception / custom logic for perf benefits

testn · 2022-01-10T02:44:46Z

Let me check.

ahmedbodi · 2022-03-15T21:56:02Z

any update on this?

sidorares · 2022-03-15T23:43:16Z

@ahmedbodi I think the discussion got sidetracked a bit into more complex issue, updated collation list LGTM to me. Merging now

Update collation list up to MySQL 8.0.26

0849b69

Fix one incorrect mapping where utf-8 should be utf-16

5117e4f

Plus add 3 more mappings for utf8mb4 307,308,309

testn commented Oct 12, 2021

View reviewed changes

Add back UTF8_GENERAL50_CI

782ee67

sidorares merged commit 810e6a4 into sidorares:master Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update collation list up to MySQL 8.0.26 #1410

Update collation list up to MySQL 8.0.26 #1410

testn commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021 •

edited

Loading

testn Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Jan 10, 2022

ahmedbodi commented Mar 15, 2022

sidorares commented Mar 15, 2022

Update collation list up to MySQL 8.0.26 #1410

Update collation list up to MySQL 8.0.26 #1410

Conversation

testn commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021 • edited Loading

testn Oct 12, 2021

Choose a reason for hiding this comment

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Oct 12, 2021

testn commented Oct 12, 2021

sidorares commented Oct 12, 2021

testn commented Jan 10, 2022

ahmedbodi commented Mar 15, 2022

sidorares commented Mar 15, 2022

testn commented Oct 12, 2021 •

edited

Loading