Enhance Kanji Recognition in Japanese Language Detection #381

michaelbennieUFL · 2024-09-26T09:12:55Z

Related Issues

Solution

This pull request improves the language detection between Japanese and Chinese, specifically when dealing with Kanji characters that are common in Japanese texts. By introducing a new script called Japanese_Han (abbreviated as JHAN), which includes Kanji characters commonly used in Japan, the detection logic is adjusted to better distinguish Japanese from Chinese when only Kanji characters are present. While there is a slight decrease in accuracy for single and dual-character inputs in the high-accuracy model, the overall sentence-level accuracy remains at 100%. This enhancement allows for accurate detection of Japanese sentences that use only Kanji.

Context

In previous versions, the language detector struggled to differentiate between Japanese and Chinese texts that contained only Kanji characters. This issue arose because both languages share many Han characters (Kanji in Japanese), leading to incorrect classification of Japanese Kanji-only texts as Chinese with high confidence. For example, words like "経済" (economy), "労働" (labor), and "勉強中" (studying) are uniquely Japanese but were being detected as Chinese.

Technical Approach

1. Introducing `Japanese_Han` Script

Created a new script Japanese_Han (JHAN) that contains Kanji characters commonly used in Japanese.
The JHAN script is added to the src/script.rs file with ranges of Kanji characters specific to Japanese usage.
This script excludes Han characters that are unique to Chinese, focusing on those prevalent in Japanese texts.

2. Adjusting Detection Logic

Updated the JAPANESE_CHARACTER_SET in src/constant.rs to use the new Japanese_Han script:

pub(crate) static JAPANESE_CHARACTER_SET: Lazy<CharSet> =
    Lazy::new(|| CharSet::from_char_classes(&["Hiragana", "Katakana", "Japanese_Han"]));

Modified the detect_language_with_rules method in src/detector.rs:
- When a character is not matched by any of the one-language alphabets, it checks against JAPANESE_CHARACTER_SET and Alphabet::Han.
- If the character matches JAPANESE_CHARACTER_SET, it increments the count for Japanese.
- If the character matches Alphabet::Han, it increments the count for Chinese.
- Adjusted the logic to handle uncertainty when both Chinese and Japanese counts are present, deciding based on counts and enabling a low-accuracy mode if necessary.

3. Handling Uncertainty

Introduced a variable cjk_lang_uncertainty to track cases where it's challenging to distinguish between Chinese and Japanese.
If both languages are detected with high uncertainty, the detector compares the counts and returns the language with the higher count when is_low_accuracy_mode_enabled is true.

Limits

Single Character Accuracy: The accuracy for single-character inputs has decreased slightly due to overlapping Kanji characters between Japanese and Chinese.
Dual Character Accuracy: Similar decrease observed in dual-character inputs for the same reasons.
Ambiguity with Shared Kanji: Some Kanji characters are used in both languages, making it inherently challenging to distinguish based solely on character analysis.
Complexity of Kanji Usage: The Japanese language occasionally uses Kanji characters that are uncommon or used differently in Chinese, and vice versa.

4. Test Cases and Accuracy Reports

Updated the accuracy reports to reflect the changes:
- Slight decrease in accuracy for single-character and dual-character inputs in the high-accuracy model.
- Sentence-level accuracy remains at 100%.
- Is now able to detect Kanji only Japanese sentences as Japanese
Tested the updated detector with various Kanji-only Japanese texts:

Test Case 1: "勉強中" 
--------------------
  Chinese: 0.50
  Japanese: 0.50

Test Case 2: "沒"  (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 3: "我是你的"  (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 4: "労働"
--------------------
  Japanese: 1.00

Test Case 5: "御免"
--------------------
  Japanese: 0.55
  Chinese: 0.45

Test Case 6: "漢字"  (This can be both)
--------------------
  Chinese: 0.79
  Japanese: 0.21

Test Case 7: "桜"
--------------------
  Japanese: 1.00

Test Case 8: "峠" (It doesn't catch this one)
--------------------
  Chinese: 1.00

Test Case 9: "畑"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 10: "塀"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 11: "経済"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 12: "和製漢字"  (This can be both)
--------------------
  Chinese: 0.78
  Japanese: 0.22

Test Case 13: "雫"
--------------------
  Japanese: 0.95
  Chinese: 0.05

Test Case 14: "労働"
--------------------
  Japanese: 1.00

Test Case 15: "豆腐"  (This can be both)
--------------------
  Chinese: 0.73
  Japanese: 0.27

Test Case 16: "自動販売機"
--------------------
  Japanese: 0.88
  Chinese: 0.12

Test Case 17: "関西国際空港"
--------------------
  Chinese: 0.59
  Japanese: 0.41

Test Case 18: "関西国际空港" (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 19: "大阪" (This can be both)
--------------------
  Japanese: 0.80
  Chinese: 0.20

Test Case 20: "東京"  (This can be both)
--------------------
  Japanese: 0.53
  Chinese: 0.47

Test Case 21: "今日は"
--------------------
  Japanese: 1.00

Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.

Increased CJK language uncertainty max ratio from 0.4 to 0.6. Added increment call for Chinese when both Chinese and Japanese are present. Removed premature return for Japanese and added a new decrement_counter method for future use.

Eliminated unnecessary `FromStr` import from `language.rs` and `Itertools` import from `model.rs`. These imports were not being used, thus removing them improves code cleanliness and reduces potential confusion.

The CJK_lang_uncertainty variables are renamed to cjk_lang_uncertainty for consistency in naming conventions. Additionally, adjusted cjk_lang_uncertainty_max_ratio for improved accuracy and removed the unused decrement_counter function from the code.

Revision updates show decreased accuracy figures for Chinese language detection across both high and low accuracy reports. The aggregated accuracy values were also modified accordingly to reflect the updated results.

Improved high and low accuracy statistics in Chinese accuracy reports. Corrected mappings for 'い' to 'あ', among other script character updates in `script.rs`.

Refactor code to correctly increment language counters and update logic for handling uncertainty between Chinese and Japanese languages. Adjust accuracy reports to reflect updated detection accuracy metrics.

michaelbennieUFL · 2024-09-26T10:23:57Z

It also fixes these issues in the python release

pemistahl/lingua-py#231

pemistahl/lingua-py#202

pemistahl · 2024-09-26T19:25:49Z

Hi Michael, thank you very much for this PR. Finally there is someone who knows Chinese and Japanese well enough to help me distinguish them better. Awesome. :) As I'm planning to make a new release of my library in October, I will evaluate your changes soon and most likely merge them, eventually. Great work!

pemistahl

Please add unit tests for the test cases you have listed in your PR.

pemistahl · 2024-10-02T10:31:17Z

src/detector.rs

+            && total_language_counts.contains_key(&Some(Language::Chinese))
+            && total_language_counts.contains_key(&Some(Language::Japanese))
+            && (cjk_lang_uncertainty as f32 / words.len() as f32) >= cjk_lang_uncertainty_max_ratio
+            && self.is_low_accuracy_mode_enabled


Why is this rule applied in low accuracy mode only? The rule engine should operate independently of the selected accuracy mode.

It's due to to the fact that on low accuracy mode, the lingua-rs doesn't use the n-gram model after running detect_language_with_rules (idk if that's right). Regardless, by adding this case in, a lot more Chinese words get recognized as Chinese in low accuracy. Otherwise, they would be misidentified as unknown. If you want I can move this logic to compute_language_confidence_values_for_languages .

Where do you want me to add the unit tests?

Just add a new unit test method in file detector.rs. I think it's best to use a parameterized test method. Just take a look at the other test methods and do it analogously.

pemistahl · 2024-10-02T10:33:53Z

src/detector.rs

@@ -896,12 +905,28 @@ impl LanguageDetector {
        if total_language_counts.len() == 2
            && cfg!(feature = "chinese")
            && cfg!(feature = "japanese")
-            && total_language_counts.contains_key(&Some(Language::from_str("Chinese").unwrap()))
-            && total_language_counts.contains_key(&Some(Language::from_str("Japanese").unwrap()))
+            && total_language_counts.contains_key(&Some(Language::Chinese))


Please replace Language::Chinese with Language::from_str("Chinese") as it was before. The same goes for Japanese. Otherwise, the code won't compile if Chinese and / or Japanese are not among the selected language dependencies.

Replace direct enum references with `Language::from_str`. This change standardizes how language enums are handled, improving code readability and consistency, especially when additional languages are added.

michaelbennieUFL added 8 commits September 25, 2024 06:41

Add Japanese_Han character set

85fc008

Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.

Add Japanese_Han character set

8502824

Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.

Remove unused imports from test modules

f712e6d

Eliminated unnecessary `FromStr` import from `language.rs` and `Itertools` import from `model.rs`. These imports were not being used, thus removing them improves code cleanliness and reduces potential confusion.

Update Chinese accuracy reports with new measurements

2de908d

Revision updates show decreased accuracy figures for Chinese language detection across both high and low accuracy reports. The aggregated accuracy values were also modified accordingly to reflect the updated results.

Update Chinese accuracy and script character mappings.

951c720

Improved high and low accuracy statistics in Chinese accuracy reports. Corrected mappings for 'い' to 'あ', among other script character updates in `script.rs`.

Fix Chinese and Japanese language detection logic

e3fee27

Refactor code to correctly increment language counters and update logic for handling uncertainty between Chinese and Japanese languages. Adjust accuracy reports to reflect updated detection accuracy metrics.

pemistahl requested changes Oct 2, 2024

View reviewed changes

Refactor language detection conditionals

4c51f9a

Replace direct enum references with `Language::from_str`. This change standardizes how language enums are handled, improving code readability and consistency, especially when additional languages are added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Kanji Recognition in Japanese Language Detection #381

Enhance Kanji Recognition in Japanese Language Detection #381

michaelbennieUFL commented Sep 26, 2024 •

edited by pemistahl

Loading

michaelbennieUFL commented Sep 26, 2024

pemistahl commented Sep 26, 2024

pemistahl left a comment

pemistahl Oct 2, 2024

michaelbennieUFL Oct 3, 2024

michaelbennieUFL Oct 3, 2024

pemistahl Oct 4, 2024

pemistahl Oct 2, 2024

Enhance Kanji Recognition in Japanese Language Detection #381

Are you sure you want to change the base?

Enhance Kanji Recognition in Japanese Language Detection #381

Conversation

michaelbennieUFL commented Sep 26, 2024 • edited by pemistahl Loading

Related Issues

Solution

Context

Technical Approach

1. Introducing Japanese_Han Script

2. Adjusting Detection Logic

3. Handling Uncertainty

Limits

4. Test Cases and Accuracy Reports

michaelbennieUFL commented Sep 26, 2024

pemistahl commented Sep 26, 2024

pemistahl left a comment

Choose a reason for hiding this comment

pemistahl Oct 2, 2024

Choose a reason for hiding this comment

michaelbennieUFL Oct 3, 2024

Choose a reason for hiding this comment

michaelbennieUFL Oct 3, 2024

Choose a reason for hiding this comment

pemistahl Oct 4, 2024

Choose a reason for hiding this comment

pemistahl Oct 2, 2024

Choose a reason for hiding this comment

michaelbennieUFL commented Sep 26, 2024 •

edited by pemistahl

Loading

1. Introducing `Japanese_Han` Script