Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kanji support #152

Open
OuOu2021 opened this issue Feb 8, 2023 · 3 comments
Open

Add Kanji support #152

OuOu2021 opened this issue Feb 8, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@OuOu2021
Copy link

OuOu2021 commented Feb 8, 2023

Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.

It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.

This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.

Also see greyblake/whatlang-rs/issues/122

@OuOu2021
Copy link
Author

OuOu2021 commented Feb 8, 2023

経済: (Chinese, 1.0)
和製漢字: (Chinese, 1.0)
雫: (Chinese, 1.0)
労働: (Chinese, 1.0)
峠: (Chinese, 1.0)
勉強中: (Chinese, 1.0)
自動販売機: (Chinese, 1.0)

They are all 100% Japanese words.

@pemistahl
Copy link
Owner

Hi @OuOu2021, thank you for reaching out to me. You can probably imagine how difficult it is to solve this problem. The language models I use for Chinese and Japanese are obviously insufficient for words such as your examples. Perhaps it helps to determine which characters are really unique to Chinese or Japanese and to extend the language models with this information. I will try to improve the library in this regard but it may take significant time as the todo list is pretty long already.

@pemistahl pemistahl added the enhancement New feature or request label Feb 15, 2023
@RoDmitry
Copy link

RoDmitry commented Sep 4, 2024

Looks like Chinese model was trained on the Traditional Chinese, and doesn't understand Simplified Chinese good enough, and also looks like Chinese model is very slow, so there is a hack to prioritize any found Han character as "Chinese", unless there are Japanese characters. But if you disable crate feature = "chinese", then any Han symbol will be considered Japanese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants