Add Kanji support #152

OuOu2021 · 2023-02-08T13:28:18Z

Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.

It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.

This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.

Also see greyblake/whatlang-rs/issues/122

OuOu2021 · 2023-02-08T13:32:06Z

経済: (Chinese, 1.0)
和製漢字: (Chinese, 1.0)
雫: (Chinese, 1.0)
労働: (Chinese, 1.0)
峠: (Chinese, 1.0)
勉強中: (Chinese, 1.0)
自動販売機: (Chinese, 1.0)

They are all 100% Japanese words.

pemistahl · 2023-02-15T08:26:50Z

Hi @OuOu2021, thank you for reaching out to me. You can probably imagine how difficult it is to solve this problem. The language models I use for Chinese and Japanese are obviously insufficient for words such as your examples. Perhaps it helps to determine which characters are really unique to Chinese or Japanese and to extend the language models with this information. I will try to improve the library in this regard but it may take significant time as the todo list is pretty long already.

RoDmitry · 2024-09-04T17:31:47Z

Looks like Chinese model was trained on the Traditional Chinese, and doesn't understand Simplified Chinese good enough, and also looks like Chinese model is very slow, so there is a hack to prioritize any found Han character as "Chinese", unless there are Japanese characters. But if you disable crate feature = "chinese", then any Han symbol will be considered Japanese.

pemistahl added the enhancement New feature or request label Feb 15, 2023

michaelbennieUFL mentioned this issue Sep 26, 2024

Enhance Kanji Recognition in Japanese Language Detection #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kanji support #152

Add Kanji support #152

OuOu2021 commented Feb 8, 2023 •

edited

Loading

OuOu2021 commented Feb 8, 2023

pemistahl commented Feb 15, 2023

RoDmitry commented Sep 4, 2024 •

edited

Loading

Add Kanji support #152

Add Kanji support #152

Comments

OuOu2021 commented Feb 8, 2023 • edited Loading

OuOu2021 commented Feb 8, 2023

pemistahl commented Feb 15, 2023

RoDmitry commented Sep 4, 2024 • edited Loading

OuOu2021 commented Feb 8, 2023 •

edited

Loading

RoDmitry commented Sep 4, 2024 •

edited

Loading