Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode escapes in the MLton backend #544

Closed
jiribenes opened this issue Aug 17, 2024 · 1 comment
Closed

Unicode escapes in the MLton backend #544

jiribenes opened this issue Aug 17, 2024 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@jiribenes
Copy link
Contributor

jiribenes commented Aug 17, 2024

Moved from #542

Motivation

When using characters like "✓" and "✕", MLton reports:

Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 391.69-391.76.
  String constant with character too large for type: #"\u2715".
    type: string
Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 561.67-561.74.
  String constant with character too large for type: #"\u2713".
    type: string

Investigation

Here's the source for MLton's lexer which indicates support for \[0-9]{3}, \u[0-9A-F]{4} and \U[0-9A-F]{8} escapes: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/front-end/ml.lex#L446-L457 but there are no real docs for it, the only thing I found suggests that multi-byte escapes should be escaped to single-bytes (locked under a flag), see http://www.mlton.org/SuccessorML#ExtendedTextConsts

The error itself is defined here: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/elaborate/elaborate-core.fun#L451-L464, and I think this indicates that we should:

  1. either use a different string type on MLton which supports these large escapes (WideString is UTF-32 [?] as far as I understand http://mlton.org/Unicode)
  2. or escape them byte-by-byte with the \[0-9]{3}-style syntax.

Solution

I think keeping strings UTF-8(-ish) is worth it, so I'd prefer the solution 2, even though it's slightly more work on our part.

Here's the code that needs to change:

case (acc, c) if (c.isControl || c < ' ' || c > '~') =>
acc ++= f"\\u${c.toInt}%04x"

I think that something like c.toString.getBytes("UTF-8") could be useful here to get a sequence of bytes which then get mapped to the \[0-9]{3} format each.

Testing

It would be very valuable to have a few more test for this behaviour and check that such characters work on all of the different backends.

@jiribenes jiribenes added the bug Something isn't working label Aug 17, 2024
@jiribenes jiribenes added the good first issue Good for newcomers label Aug 17, 2024
unnir added a commit to unnir/effekt that referenced this issue Aug 31, 2024
Fixes effekt-lang#544

Update the `escape` function in `Transformer.scala` to convert characters outside the ASCII range to byte-by-byte format.

* Modify the `escape` function to use `c.toString.getBytes("UTF-8")` to get a sequence of bytes for characters outside the ASCII range.
* Map each byte to the `\\[0-9]{3}` format required for byte-by-byte escaping.
* Add a new test file `MLTests.scala` to ensure that characters like '✓' and '✕' are correctly escaped to the `\\[0-9]{3}` format.
* Verify that the new implementation works on all different backends.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/effekt-lang/effekt/issues/544?shareId=XXXX-XXXX-XXXX-XXXX).
@jiribenes
Copy link
Contributor Author

MLton has been deprecated as of #616

@jiribenes jiribenes reopened this Oct 1, 2024
@jiribenes jiribenes closed this as not planned Won't fix, can't repro, duplicate, stale Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant