Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK characters modified by translation (even in input tab) #83

Open
seelebrn opened this issue Jul 2, 2023 · 2 comments
Open

CJK characters modified by translation (even in input tab) #83

seelebrn opened this issue Jul 2, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@seelebrn
Copy link

seelebrn commented Jul 2, 2023

Hello !
As per title, when trying to use a model (here, opus+bt-2021-04-30, multilingual) to translate a sentence (e.g. : "【测试】哎呀?我的台本哪里去了?我现在应该说啥?") to English, I noticed some characters (【 and 】) are modified even in the input tab.

【测试】哎呀?我的台本哪里去了?我现在应该说啥?

... becomes :

[测试]哎呀?我的台本哪里去了?我现在应该说啥?

So I suppose there's no way the translation is going to be exact. And indeed, it outputs :

[Test] Oh? Where is my script? What should I say now?

Is it possible to fix this so, for example, 【 and 】 are not transformed before generation ?

Thanks !

@TommiNieminen TommiNieminen added the bug Something isn't working label Jul 3, 2023
@TommiNieminen
Copy link
Collaborator

Hi,

It seems the lenticular brackets are converted to standard brackets by the OPUS MT model preprocessing, so the model just treats them as standard square brackets. This is probably an error, since at least according to Wikipedia, the lenticular brackets denote headings etc., i.e. they are not equivalent to standard brackets.

This can be fixed only by retraining the models, so I'll make a note of this and hopefully we can modify the preprocessing script for the next training run.

-Tommi

@seelebrn
Copy link
Author

seelebrn commented Jul 5, 2023

Thanks for answering ! I hope this will be possible indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants