Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ENGLISH_COMPATIBLE #165

Open
wasertech opened this issue Oct 19, 2022 · 0 comments
Open

Improve ENGLISH_COMPATIBLE #165

wasertech opened this issue Oct 19, 2022 · 0 comments

Comments

@wasertech
Copy link

wasertech commented Oct 19, 2022

Use:

uni2ascii -q wiki_fr_lower_accents.txt > wiki_fr_lower.txt

Instead of:

iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE < wiki_fr_lower_accents.txt > wiki_fr_lower.txt

https://billposer.org/Software/uni2ascii.html

Why?

I've dump Wikipedia in English to make a custom scorer.

Here is the result with iconv:

+ build_lm.sh
+ '[' 1 = 1 ']'
+ OLD_LANG=C.UTF-8
+ export LANG=en_US.UTF-8
+ LANG=en_US.UTF-8
+ pushd /mnt/extracted/
/mnt/extracted ~
+ /home/trainer/en_custom/prepare_lm.sh
+ '[' '!' -f en/wiki_en_lower.txt ']'
+ curl -sSL 'https://gitlab.com/waser-technologies/data/lm/en/wiki-dump/-/raw/main/wiki.en.txt?inline=false'
+ tr '[:upper:]' '[:lower:]'
+ '[' 1 = 1 ']'
+ mv en/wiki_en_lower.txt en/wiki_en_lower_accents.txt
+ head -n 5 en/wiki_en_lower_accents.txt
beliefs on how to abolish the state also differ.
contemporary anarchists such as ward claim that state education serves to perpetuate socioeconomic inequality.
marxists state that this contradiction was responsible for their inability to act.
both positive feedback loops have long been recognized as important for global warming.
cloud albedo has substantial influence over atmospheric temperatures.
+ iconv -f utf-8 -t ascii//TRANSLIT//IGNORE
iconv: illegal input sequence at position 26095

     {!} : Aborted
         : Container exited with code 1.

If the iconv route works with our old wiki dump for french, i'm sure if I do a new one now, chances are we'll also get an illegal input sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant