Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test and incorporate char.tsv to improve syllable error rate #6

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AlienKevin
Copy link

This PR features 2 additions:

  1. Adds a simple test using human-annotated jyutping sentences from words.hk. The test module test/test.py outputs the correct and incorrect sentences as text files for inspection and also gives an averaged syllable error rate over the entire corpus.
  2. Incorporate char.tsv in the preprocess.py so that most frequent 預設 jyutpings for characters can overwrite uncommon pronuncations in the jyut6ping3.dict.yaml file.

Benefits

After the addition of 2, the syllable error rate decreased almost 20% from 7.33% down to 5.88%.

Future work

May need proper word segmentation instead of longest prefix match for more accurate handling of polyphones.

@@ -4,6 +4,7 @@
t2s = OpenCC('t2s').convert

os.system('wget -nc https://raw.githubusercontent.com/rime/rime-cantonese/5b6d334/jyut6ping3.dict.yaml')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much will the results be affected if we eliminate the downstream file and grab all files from upstream? Presuming we are going with this, let’s make upstream a submodule instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think submodule would be hard to maintain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will? Isn’t a submodule just an SHA, as if we manually amend the link regularly in this file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then it may not be a problem

@laubonghaudoi
Copy link
Member

呢個 PR 好似好大,我想知嗰啲 txt 係邊度嚟嘅?啲標音嘅正確率如何?

@graphemecluster
Copy link
Member

@laubonghaudoi 正確率咪就係上面 Benefits 嗰度寫嘅嘢

Copy link
Member

@graphemecluster graphemecluster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗰兩個 *_base file 喺邊度嚟嘅?係咪就係未 normalize 嘅 result?
同埋我 prefer 唔 include generated files,另外嗰兩個 results file 你擺上嚟/send 畀我哋就得

Comment on lines +86 to +91
reference = remove_ng_onset(normalize_nei_to_ni(reference))
hypothesis = remove_ng_onset(normalize_nei_to_ni(hypothesis))
if reference == hypothesis or \
diff_by_tone_only(reference, hypothesis) or \
diff_by_a(reference, hypothesis):
continue
Copy link
Member

@graphemecluster graphemecluster Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest let jiwer perform these normalizations by adding the third and fourth parameters to jiwer.wer. Currently sentences differing by tone or -a are included in neither correct nor wrong sentences file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filtered those sentences out because many times they are only stylistic differences in the romanization or sometimes personal preferences for 變調. However, it's generally hard to tell whether it's a true error or just stylistic difference so I think it's important to consider those differences in the calculation of WER.

I view the output sentences as an overview for humans to find and fix common error patterns. Since we already know that some stylistic differences do occur and are many times false alarms, I think we can safely filter them out in the sentences files to help us focus on more pressing and easier-to-fix issues.

@graphemecluster
Copy link
Member

如果你得閒,可唔可以用 ToJyutping.get_jyutping 統計一下邊啲字錯得最多?

@laubonghaudoi
Copy link
Member

唔係,我係唔理解點解要將幾個大 txt 加入呢個包入面,既然佢哋唔係個程式依賴嘅數據,只不過係用嚟做 benchmark嘅,噉點解要加落呢個包度?我問正確率如何意思係,我見到啲 txt 有 wrong 又有 wrong base,意思即係佢啲數據係人標嘅而唔係呢隻程式嘅輸出?噉既然係人標嘅就會有準確率?

@AlienKevin
Copy link
Author

我問正確率如何意思係,我見到啲 txt 有 wrong 又有 wrong base,意思即係佢啲數據係人標嘅而唔係呢隻程式嘅輸出?噉既然係人標嘅就會有準確率?

Sorry for the confusion. The base files are the output of the version before this PR while the non-base files are the output after this PR. Those files are meant to ease human inspection and shouldn't be packaged for a release. I'm not too familiar with how Python packages are released in general, so feel free to delete those files or put them into gitignore, etc.

@laubonghaudoi
Copy link
Member

噉你將佢哋加入去 .gitignore 兼刪埋佢啦

@AlienKevin
Copy link
Author

AlienKevin commented Sep 8, 2023

噉你將佢哋加入去 .gitignore 兼刪埋佢啦

Ok,搞掂。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants