alexandrainst · saattrupdan · Sep 27, 2024 · Sep 24, 2024 · Sep 24, 2024 · Sep 24, 2024
diff --git a/MODEL_315M_README.md b/MODEL_315M_README.md
@@ -3,6 +3,8 @@
 This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
 Institute](https://alexandra.dk/).
 
+Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!
+
 
 ## Quick Start
 Start by installing the required libraries:

diff --git a/MODEL_315M_XLSR_README.md b/MODEL_315M_XLSR_README.md
@@ -0,0 +1,160 @@
+# Røst-315m
+
+This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
+Institute](https://alexandra.dk/).
+
+Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!
+
+
+## Quick Start
+Start by installing the required libraries:
+
+```shell
+$ pip install transformers kenlm pyctcdecode
+```
+
+Next you can use the model using the `transformers` Python package as follows:
+
+```python
+>>> from transformers import pipeline
+>>> audio = get_audio()  # 16kHz raw audio array
+>>> transcriber = pipeline(model="alexandrainst/roest-315m")
+>>> transcriber(audio)
+{'text': 'your transcription'}
+```
+
+
+## Evaluation Results
+
+We have evaluated both our and existing models on the CoRal test set as well as the
+Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
+bootstrapped the results 1000 times and report here the mean scores along with a 95%
+confidence interval (lower is better; best scores in **bold**, second-best in
+*italics*):
+
+| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
+|:---|---:|---:|---:|---:|---:|
+| Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
+| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
+| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
+| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
+| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
+| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
+| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
+| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
+| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
+| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
+
+
+### Detailed Evaluation Across Demographics on the CoRal Test Set
+
+![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-cer-plot.png)
+![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-wer-plot.png)
+
+
+## Training Data
+
+This model is the result of four different stages of training:
+
+  1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
+     13,628 hours of which is Danish. Pretraining here means that the model learnt to
+     "fill in" gaps of raw audio - no transcriptions were used (or available) during
+     this process. The pretraining data is distributed as follows:
+     - 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
+       speeches from the European Parliament in 23 European languages.
+       This includes 13,600 hours of Danish speech.
+     - 51,000 hours from [Multilingual
+       LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
+       8 European languages. This does not include any Danish speech.
+     - 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
+       being read-aloud speech in 60 diverse languages. This does not include any Danish
+       speech.
+     - 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
+       being audio from YouTube videos in 107 languages. This includes 28 hours of
+       Danish speech.
+     - 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
+       conversational telephone speech in 17 African and Asian languages. This does not
+       include any Danish speech.
+  2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
+     indicates that this stage of training was supervised, i.e. the model was trained on
+     both audio and transcriptions to perform the speech-to-text task (also known as
+     automatic speech recognition). The finetuning data is as follows:
+     - The read-aloud training split of the [CoRal
+       dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
+       fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
+       read-aloud speech, diverse across dialects, accents, ages and genders.
+     - The Danish training split of the [Common Voice 17
+       dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
+       consisting of 12 hours of Danish read-aloud speech.
+  3. An n-gram language model has been trained separately, and is used to guide the
+     transcription generation of the finetuned speech recognition model. This n-gram
+     language model has been trained on the following datasets:
+     - [Danish
+       Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
+       (approximately 287,000 articles).
+     - [Danish Common Voice 17 training
+       split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
+       (approximately 3,500 comments).
+     - [Danish
+       Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
+       (approximately 5 million comments).
+     Note that all samples from the CoRal test dataset have been removed from all of
+     these datasets, to ensure that the n-gram model has not seen the test data.
+
+The first step was trained by [Babu et al.
+(2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
+[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
+
+The final product is then the combination of the finetuned model along with the n-gram
+model, and this is what is used when you use the model as mentioned in the Quick Start
+section above.
+
+
+## Intended use cases
+
+This model is intended to be used for Danish automatic speech recognition.
+
+Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
+models. For more information, see addition 4 in our
+[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
+
+
+## Why the name Røst?
+
+Røst is both the [Danish word for the human
+voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
+of the cold-water coral reefs in
+Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
+
+
+## License
+The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
+commercial use with a few restrictions (speech synthesis and biometric identification).
+See
+[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
+
+
+## Creators and Funders
+The CoRal project is funded by the [Danish Innovation
+Fund](https://innovationsfonden.dk/) and consists of the following partners:
+
+- [Alexandra Institute](https://alexandra.dk/)
+- [University of Copenhagen](https://www.ku.dk/)
+- [Agency for Digital Government](https://digst.dk/)
+- [Alvenir](https://www.alvenir.ai/)
+- [Corti](https://www.corti.ai/)
+
+
+## Citation
+
+We will submit a research paper soon, but until then, if you use this model in your
+research or development, please cite it as follows:
+
+```bibtex
+@dataset{coral2024,
+  author    = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
+  title     = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
+  year      = {2024},
+  url       = {https://hf.co/datasets/alexandrainst/coral},
+}
+```
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ ______________________________________________________________________
 [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://alexandrainst.github.io/coral/coral.html)
 [![License](https://img.shields.io/github/license/alexandrainst/coral)](https://github.com/alexandrainst/coral/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/alexandrainst/coral)](https://github.com/alexandrainst/coral/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-59%25-orange.svg)](https://github.com/alexandrainst/coral/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-55%25-orange.svg)](https://github.com/alexandrainst/coral/tree/main/tests)
 
 
 Developers:

diff --git a/config/asr_finetuning.yaml b/config/asr_finetuning.yaml
@@ -2,6 +2,10 @@ defaults:
   - model: wav2vec2-small
   - datasets:
     - coral
+  - decoder_datasets:
+    - wikipedia
+    - common_voice
+    - reddit
   - override hydra/job_logging: custom
   - _self_
 
@@ -10,7 +14,6 @@ seed: 4242
 evaluation_dataset:
   id: alexandrainst/coral
   subset: read_aloud
-  split: test
   val_name: val
   text_column: text
   audio_column: audio
@@ -41,7 +44,8 @@ hub_organisation: alexandrainst
 push_to_hub: false
 create_pr: false
 private: false
-fp16: true
+fp16_allowed: true
+bf16_allowed: true
 
 # Training parameters
 wandb: false
@@ -65,3 +69,6 @@ eval_steps: 500
 save_steps: 500
 early_stopping: false
 early_stopping_patience: 50
+
+# NOTE: This is automatically set to false in a multi-gpu setting
+gradient_checkpointing: true
diff --git a/config/datasets/common_voice_17.yaml b/config/datasets/common_voice_17.yaml
@@ -4,5 +4,5 @@ common_voice_17:
   train_name: train
   text_column: sentence
   audio_column: audio
-  filter_dataset: true
+  filter_dataset: false
   process_dataset: true
diff --git a/config/decoder_datasets/common_voice.yaml b/config/decoder_datasets/common_voice.yaml
@@ -0,0 +1,6 @@
+common_voice:
+  id: mozilla-foundation/common_voice_17_0
+  subset: da
+  split: train
+  text_column: sentence
+  audio_column: audio
diff --git a/config/decoder_datasets/reddit.yaml b/config/decoder_datasets/reddit.yaml
@@ -0,0 +1,6 @@
+reddit:
+  id: alexandrainst/scandi-reddit
+  subset: da
+  split: train
+  text_column: doc
+  audio_column: null
diff --git a/config/decoder_datasets/wikipedia.yaml b/config/decoder_datasets/wikipedia.yaml
@@ -0,0 +1,6 @@
+wikipedia:
+  id: alexandrainst/scandi-wiki
+  subset: da
+  split: train
+  text_column: text
+  audio_column: null
diff --git a/config/model/test-wav2vec2.yaml b/config/model/test-wav2vec2.yaml
@@ -1,6 +1,6 @@
 name: test-wav2vec2
 type: wav2vec2
-pretrained_model_id: chcaa/xls-r-300m-danish
+pretrained_model_id: facebook/wav2vec2-xls-r-300m
 freeze_feature_encoder: true
 
 # Data hyperparameters
@@ -20,7 +20,7 @@ mask_time_length: 10
 mask_feature_prob: 0.5
 mask_feature_length: 64
 layerdrop: 0.1  # NOTE: This will automatically be set to 0 in a multi-gpu setting
-ctc_loss_reduction: sum
+ctc_loss_reduction: mean
 
-# Decoder hyperparameters
-decoder: null
+# Decoder hyperparameters
+use_decoder: false
diff --git a/config/model/wav2vec2-large.yaml b/config/model/wav2vec2-large.yaml
@@ -20,12 +20,8 @@ mask_time_length: 10
 mask_feature_prob: 0.5
 mask_feature_length: 64
 layerdrop: 0.1  # NOTE: This will automatically be set to 0 in a multi-gpu setting
-ctc_loss_reduction: sum
+ctc_loss_reduction: mean
 
-# Decoder hyperparameters
-decoder:
-  dataset_id: alexandrainst/scandi-wiki
-  dataset_subset: da
-  dataset_split: train
-  text_column: text
-  n: 5
+# Decoder hyperparameters
+use_decoder: true
+decoder_num_ngrams: 3
diff --git a/config/model/wav2vec2-medium.yaml b/config/model/wav2vec2-medium.yaml
@@ -20,12 +20,8 @@ mask_time_length: 10
 mask_feature_prob: 0.5
 mask_feature_length: 64
 layerdrop: 0.1  # NOTE: This will automatically be set to 0 in a multi-gpu setting
-ctc_loss_reduction: sum
+ctc_loss_reduction: mean
 
-# Decoder hyperparameters
-decoder:
-  dataset_id: alexandrainst/scandi-wiki
-  dataset_subset: da
-  dataset_split: train
-  text_column: text
-  n: 5
+# Decoder hyperparameters
+use_decoder: true
+decoder_num_ngrams: 3
diff --git a/config/model/wav2vec2-small.yaml b/config/model/wav2vec2-small.yaml
@@ -1,6 +1,6 @@
 name: wav2vec2-small
 type: wav2vec2
-pretrained_model_id: chcaa/xls-r-300m-danish
+pretrained_model_id: facebook/wav2vec2-xls-r-300m
 freeze_feature_encoder: false
 
 # Data hyperparameters
@@ -20,12 +20,8 @@ mask_time_length: 10
 mask_feature_prob: 0.5
 mask_feature_length: 64
 layerdrop: 0.1  # NOTE: This will automatically be set to 0 in a multi-gpu setting
-ctc_loss_reduction: sum
+ctc_loss_reduction: mean
 
-# Decoder hyperparameters
-decoder:
-  dataset_id: alexandrainst/scandi-wiki
-  dataset_subset: da
-  dataset_split: train
-  text_column: text
-  n: 5
+# Decoder hyperparameters
+use_decoder: true
+decoder_num_ngrams: 3
diff --git a/makefile b/makefile
@@ -117,3 +117,16 @@ type-check:  ## Run type checking
 		--ignore-missing-imports \
 		--show-error-codes \
 		--check-untyped-defs
+
+roest-315m:  ## Train the Røst-315M model
+	@accelerate launch \
+		--use-deepspeed \
+		src/scripts/finetune_asr_model.py \
+		model=wav2vec2-small \
+		datasets=[coral] \
+		decoder_datasets=[wikipedia,common_voice,reddit] \
+		push_to_hub=true \
+		dataloader_num_workers=4 \
+		model_id=roest-315m-xlsr \
+		private=true \
+		per_device_batch_size=64
diff --git a/src/coral/finetune.py b/src/coral/finetune.py
@@ -5,13 +5,13 @@
 
 from omegaconf import DictConfig
 from transformers import EarlyStoppingCallback, TrainerCallback
+from wandb import finish as wandb_finish
 from wandb.sdk.wandb_init import init as wandb_init
-from wandb.sdk.wandb_run import finish as wandb_finish
 
 from .data import load_data_for_finetuning
 from .data_models import ModelSetup
 from .model_setup import load_model_setup
-from .ngram import train_ngram_model
+from .ngram import train_and_store_ngram_model
 from .utils import block_terminal_output, disable_tqdm, push_model_to_hub
 
 logger = logging.getLogger(__package__)
@@ -58,14 +58,13 @@ def finetune(config: DictConfig) -> None:
     block_terminal_output()
     with disable_tqdm():
         trainer.train(resume_from_checkpoint=config.resume_from_checkpoint)
-
     if config.wandb and is_main_process:
         wandb_finish()
 
     model.save_pretrained(save_directory=config.model_dir)
 
-    if hasattr(config.model, "decoder") and config.model.decoder is not None:
-        train_ngram_model(config=config)
+    if hasattr(config.model, "use_decoder") and config.model.use_decoder:
+        train_and_store_ngram_model(config=config)
 
     if config.push_to_hub:
         push_model_to_hub(