Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters incorrectly escaped in encode #334

Open
samhh opened this issue Aug 16, 2020 · 3 comments
Open

Unicode characters incorrectly escaped in encode #334

samhh opened this issue Aug 16, 2020 · 3 comments
Assignees
Labels
bug Something isn't working pretty-printer Everything related to `Toml -> Text`

Comments

@samhh
Copy link

samhh commented Aug 16, 2020

At least, that's what I think is happening. In a REPL, the following will fail:

Toml.decode (Toml.text "k") $ Toml.encode (Toml.text "k") "ü"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the encoding, here's what we're given:

Toml.encode (Toml.text "k") "ü"
-- "k = \"\\252\"\n"

And passing that into decode will fail per the above:

Toml.decode (Toml.text "k") "k = \"\\252\"\n"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the TOML spec, it looks like Unicode characters should be encoded with a \u prefix. Modifying the string to contain an extra u0 allows encoding to succeed, and I think that's what we want given that's roughly the decimal output in an online Unicode converter:

Toml.decode (Toml.text "k") "k = \"\\u0252\"\n"
-- Right "\594"

But I'm pretty ignorant about character encoding and am honestly not sure if that's the right output. 😄

@chshersh chshersh added bug Something isn't working pretty-printer Everything related to `Toml -> Text` labels Aug 18, 2020
@chshersh
Copy link
Contributor

Hi @samhh, thanks for submitting the issue! This indeed looks like an unexpected behaviour 😞
tomland uses Text internally, and during encoding Text is printed using the show function. The show does the escaping of all characters. You can reproduce this behaviour even easier in GHCi:

λ: show "ü"
"\"\\252\""

It looks like some smarter handling of Unicode characters is required to preserve TOML semantics.

Relevant code to change is here:

valText (Text s) = showText s

A more interesting question is why such errors weren't caught by our property tests? 🤔 🤔 🤔

@dariodsa
Copy link
Collaborator

dariodsa commented Dec 8, 2020

I implemented the requested feature, now I only have to implement tests and figure it out why it wasn't caught by our test cases. Code needs some cleaning but it is working. :-)

@dariodsa
Copy link
Collaborator

dariodsa commented Dec 8, 2020

Our tests was ok, but they were testing something different. Text was generated with unicode characters but they were written like \u010d, but they weren't generate in its real form, č. Thank you @samhh for catching that error.
I will need to rewrite some tests but I will explain it in more details in PR.

@dariodsa dariodsa mentioned this issue Dec 9, 2020
dariodsa added a commit that referenced this issue Dec 19, 2020
chshersh pushed a commit that referenced this issue Dec 27, 2020
* [#334] parse and unparse tests

* removed parsing and unparing tests

* [#334] showUnicodeText

* [#334] escaping unicode character as well as regular characters

* [#334] resolved issue with escaping regular unescaped chars

* added tests, but they are not in use

* examples.hs revert to original content

* [#334] changes requested by chshersh
yongrenjie added a commit to yongrenjie/abbotsbury that referenced this issue May 9, 2021
The `tomland` library currently has some issues with Unicode characters,
which makes it not usable right now. In the future we could try to
migrate again but for now we should probably stick to YAML.

See kowainik/tomland#334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pretty-printer Everything related to `Toml -> Text`
Projects
None yet
Development

No branches or pull requests

3 participants