You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've dump Wikipedia in English to make a custom scorer.
Here is the result with iconv:
+ build_lm.sh
+ '[' 1 = 1 ']'
+ OLD_LANG=C.UTF-8
+ export LANG=en_US.UTF-8
+ LANG=en_US.UTF-8
+ pushd /mnt/extracted/
/mnt/extracted ~
+ /home/trainer/en_custom/prepare_lm.sh
+ '[' '!' -f en/wiki_en_lower.txt ']'
+ curl -sSL 'https://gitlab.com/waser-technologies/data/lm/en/wiki-dump/-/raw/main/wiki.en.txt?inline=false'
+ tr '[:upper:]' '[:lower:]'
+ '[' 1 = 1 ']'
+ mv en/wiki_en_lower.txt en/wiki_en_lower_accents.txt
+ head -n 5 en/wiki_en_lower_accents.txt
beliefs on how to abolish the state also differ.
contemporary anarchists such as ward claim that state education serves to perpetuate socioeconomic inequality.
marxists state that this contradiction was responsible for their inability to act.
both positive feedback loops have long been recognized as important for global warming.
cloud albedo has substantial influence over atmospheric temperatures.
+ iconv -f utf-8 -t ascii//TRANSLIT//IGNORE
iconv: illegal input sequence at position 26095
{!} : Aborted
: Container exited with code 1.
If the iconv route works with our old wiki dump for french, i'm sure if I do a new one now, chances are we'll also get an illegal input sequence.
The text was updated successfully, but these errors were encountered:
Use:
uni2ascii -q wiki_fr_lower_accents.txt > wiki_fr_lower.txt
Instead of:
commonvoice-fr/DeepSpeech/fr/prepare_lm.sh
Line 19 in 5699e59
https://billposer.org/Software/uni2ascii.html
Why?
I've dump Wikipedia in English to make a custom scorer.
Here is the result with
iconv
:If the
iconv
route works with our old wiki dump for french, i'm sure if I do a new one now, chances are we'll also get anillegal input sequence
.The text was updated successfully, but these errors were encountered: