Unicode escapes in the MLton backend #544

jiribenes · 2024-08-17T09:45:32Z

Moved from #542

Motivation

When using characters like "✓" and "✕", MLton reports:

Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 391.69-391.76.
  String constant with character too large for type: #"\u2715".
    type: string
Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 561.67-561.74.
  String constant with character too large for type: #"\u2713".
    type: string

Investigation

Here's the source for MLton's lexer which indicates support for \[0-9]{3}, \u[0-9A-F]{4} and \U[0-9A-F]{8} escapes: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/front-end/ml.lex#L446-L457 but there are no real docs for it, the only thing I found suggests that multi-byte escapes should be escaped to single-bytes (locked under a flag), see http://www.mlton.org/SuccessorML#ExtendedTextConsts

The error itself is defined here: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/elaborate/elaborate-core.fun#L451-L464, and I think this indicates that we should:

either use a different string type on MLton which supports these large escapes (WideString is UTF-32 [?] as far as I understand http://mlton.org/Unicode)
or escape them byte-by-byte with the \[0-9]{3}-style syntax.

Solution

I think keeping strings UTF-8(-ish) is worth it, so I'd prefer the solution 2, even though it's slightly more work on our part.

Here's the code that needs to change:

effekt/effekt/shared/src/main/scala/effekt/generator/ml/Transformer.scala

Lines 637 to 638 in 08fc8fd

    
                 case (acc, c) if (c.isControl || c < ' ' || c > '~') => 
        
                   acc ++= f"\\u${c.toInt}%04x"

I think that something like c.toString.getBytes("UTF-8") could be useful here to get a sequence of bytes which then get mapped to the \[0-9]{3} format each.

Testing

It would be very valuable to have a few more test for this behaviour and check that such characters work on all of the different backends.

The text was updated successfully, but these errors were encountered:

Fixes effekt-lang#544 Update the `escape` function in `Transformer.scala` to convert characters outside the ASCII range to byte-by-byte format. * Modify the `escape` function to use `c.toString.getBytes("UTF-8")` to get a sequence of bytes for characters outside the ASCII range. * Map each byte to the `\\[0-9]{3}` format required for byte-by-byte escaping. * Add a new test file `MLTests.scala` to ensure that characters like '✓' and '✕' are correctly escaped to the `\\[0-9]{3}` format. * Verify that the new implementation works on all different backends. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/effekt-lang/effekt/issues/544?shareId=XXXX-XXXX-XXXX-XXXX).

jiribenes · 2024-10-01T16:02:33Z

MLton has been deprecated as of #616

jiribenes added the bug Something isn't working label Aug 17, 2024

jiribenes mentioned this issue Aug 17, 2024

Prettify the stdlib testing framework #542

Merged

1 task

jiribenes added the good first issue Good for newcomers label Aug 17, 2024

unnir mentioned this issue Aug 31, 2024

Fix Unicode escapes in the MLton backend #564

Closed

jiribenes closed this as completed Oct 1, 2024

jiribenes reopened this Oct 1, 2024

jiribenes closed this as not planned Won't fix, can't repro, duplicate, stale Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode escapes in the MLton backend #544

Unicode escapes in the MLton backend #544

jiribenes commented Aug 17, 2024 •

edited

Loading

jiribenes commented Oct 1, 2024

Unicode escapes in the MLton backend #544

Unicode escapes in the MLton backend #544

Comments

jiribenes commented Aug 17, 2024 • edited Loading

Motivation

Investigation

Solution

Testing

jiribenes commented Oct 1, 2024

jiribenes commented Aug 17, 2024 •

edited

Loading