Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/allow multiple decoder datasets #97

Merged
merged 73 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
9d7d6b6
feat: Allow multiple decoder datasets
saattrupdan Sep 24, 2024
1313c8b
fix: Use asr_finetuning instead of finetuning
saattrupdan Sep 24, 2024
7f0733a
fix: Add decoder_num_ngrams to config
saattrupdan Sep 24, 2024
9776ffa
fix: Convert to dataset before filtering
saattrupdan Sep 24, 2024
0ae4f78
fix: Specify "text" being the text_column in ngram
saattrupdan Sep 24, 2024
e3204ff
fix: Use concatenate_datasets instead of interleaving
saattrupdan Sep 24, 2024
4da97a2
docs: Add logging
saattrupdan Sep 24, 2024
5c843f9
chore: Concatenate and shuffle datasets seperately
saattrupdan Sep 24, 2024
46fbfe6
debug
saattrupdan Sep 24, 2024
8603f52
feat: Remove evaluation sentences in parallel
saattrupdan Sep 24, 2024
1b2f18f
docs: Improve comments
saattrupdan Sep 24, 2024
63f3aba
feat: Change default order of kenlm to 3
saattrupdan Sep 24, 2024
2d2b324
debug
saattrupdan Sep 24, 2024
285698b
fix: Do not filter eval dataset when loading during ngram training
saattrupdan Sep 24, 2024
7fa1a5e
debug
saattrupdan Sep 24, 2024
da81818
fix: Pruning arguments
saattrupdan Sep 24, 2024
3d4d35d
chore: Remove breakpoints
saattrupdan Sep 24, 2024
b8a3584
fix: Use config seed
saattrupdan Sep 24, 2024
604bad8
fix: Specify kenlm memory limit and temp file location
saattrupdan Sep 24, 2024
60459a2
chore: Add logging regarding number of sentences removed
saattrupdan Sep 24, 2024
4909e3b
chore: Add dedup logging
saattrupdan Sep 24, 2024
f9c12a6
chore: Log number of sentences
saattrupdan Sep 24, 2024
f82b392
chore: More logging
saattrupdan Sep 24, 2024
b67f54b
fix: Set kenlm temp dir to cache dir
saattrupdan Sep 24, 2024
e0f3f39
fix: Add del statements in ngram
saattrupdan Sep 25, 2024
deaad38
refactor: Move `get_sentence_corpus_path` into separate function
saattrupdan Sep 25, 2024
bda2c9d
refactor: Separate functions for training and storing the ngram models
saattrupdan Sep 25, 2024
6a7320d
refactor: More refactoring of ngram
saattrupdan Sep 25, 2024
f398954
chore: Change pretrained_model_id for wav2vec2-small to xls-r-300m
saattrupdan Sep 25, 2024
11c89d6
fix: Use train_and_store_ngram_model rather than train_ngram_model
saattrupdan Sep 25, 2024
66f9eba
fix: Use tuple instead of list when hashing
saattrupdan Sep 25, 2024
9209760
refactor: Simplify
saattrupdan Sep 25, 2024
eafe789
fix: Lower n_jobs from -1 to -2
saattrupdan Sep 25, 2024
6549ea7
style: Nicer logging
saattrupdan Sep 25, 2024
856a2a1
chore: Do not filter common_voice_17, as it already satisfies the req…
saattrupdan Sep 25, 2024
77e4e05
fix: Replace train_ngram_model with train_and_store_ngram_model
saattrupdan Sep 26, 2024
0659036
debug: Try disabling gradient_checkpointing
saattrupdan Sep 26, 2024
6436b88
fix: Disable gradient checkpointing in a multi-GPU setup
saattrupdan Sep 26, 2024
a9123ac
fix: Check for multi-GPU using torch.cuda.device_count as well
saattrupdan Sep 26, 2024
5100bf5
debug: Check if layerdrop and padding actually need changing in a mul…
saattrupdan Sep 26, 2024
deff165
fix: Do not change layerdrop and padding in a multi-GPU setup, as it …
saattrupdan Sep 26, 2024
d394a85
chore: Revert
saattrupdan Sep 26, 2024
e31c03b
debug: Check if gradient_checkpointing is needed
saattrupdan Sep 26, 2024
9cb943f
docs: Add multi-GPU usage to module docstring of finetune_asr_model
saattrupdan Sep 26, 2024
4b6295c
feat: Nudge user to use `accelerate` when multiple GPUs are available
saattrupdan Sep 26, 2024
1cb7642
feat: Allow usage of bf16
saattrupdan Sep 27, 2024
03be002
fix: Change ctc_loss_reduction default to mean
saattrupdan Sep 27, 2024
706eda6
chore: Only log if main process
saattrupdan Sep 27, 2024
aa3a1ce
tests: Config name
saattrupdan Sep 27, 2024
d252bed
tests: Configs
saattrupdan Sep 27, 2024
50336a5
chore: Add `make roest-315m`
saattrupdan Sep 27, 2024
0b1f256
docs: Update code coverage
saattrupdan Sep 27, 2024
c1cff81
fix: Ensure cache_dir is not None
saattrupdan Sep 27, 2024
1744ad5
fix: Install kenlm dependencies
saattrupdan Sep 27, 2024
7a39b88
chore: Update gitignore
saattrupdan Sep 27, 2024
8b48e86
chore: Gitignore cleanup
saattrupdan Sep 27, 2024
4398dae
fix: Add --no-upgrade to apt-get install
saattrupdan Sep 27, 2024
3706b81
chore: Temporarily comment out training, to just push to hub
saattrupdan Sep 27, 2024
4e30fc7
debug
saattrupdan Sep 27, 2024
1202776
debug
saattrupdan Sep 27, 2024
d0ce95c
debug
saattrupdan Sep 27, 2024
3088406
debug
saattrupdan Sep 27, 2024
b7a4ed1
debug
saattrupdan Sep 27, 2024
8d1c39e
debug
saattrupdan Sep 27, 2024
9dea54a
debug
saattrupdan Sep 27, 2024
c459e52
debug
saattrupdan Sep 27, 2024
00fffed
debug
saattrupdan Sep 27, 2024
523f999
debug
saattrupdan Sep 27, 2024
1ff96f1
docs: Add XLS-R readme
saattrupdan Sep 27, 2024
6b4a97b
docs: Update model readmes
saattrupdan Sep 27, 2024
505ac0b
docs: Update plots
saattrupdan Sep 27, 2024
098931c
debug
saattrupdan Sep 27, 2024
fe3b0ea
chore: Revert to pre-debugging
saattrupdan Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions MODEL_315M_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
Institute](https://alexandra.dk/).

Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!


## Quick Start
Start by installing the required libraries:
Expand Down
160 changes: 160 additions & 0 deletions MODEL_315M_XLSR_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Røst-315m

This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
Institute](https://alexandra.dk/).

Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!


## Quick Start
Start by installing the required libraries:

```shell
$ pip install transformers kenlm pyctcdecode
```

Next you can use the model using the `transformers` Python package as follows:

```python
>>> from transformers import pipeline
>>> audio = get_audio() # 16kHz raw audio array
>>> transcriber = pipeline(model="alexandrainst/roest-315m")
>>> transcriber(audio)
{'text': 'your transcription'}
```


## Evaluation Results

We have evaluated both our and existing models on the CoRal test set as well as the
Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
bootstrapped the results 1000 times and report here the mean scores along with a 95%
confidence interval (lower is better; best scores in **bold**, second-best in
*italics*):

| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
|:---|---:|---:|---:|---:|---:|
| Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |


### Detailed Evaluation Across Demographics on the CoRal Test Set

![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-cer-plot.png)
![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-wer-plot.png)


## Training Data

This model is the result of four different stages of training:

1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
13,628 hours of which is Danish. Pretraining here means that the model learnt to
"fill in" gaps of raw audio - no transcriptions were used (or available) during
this process. The pretraining data is distributed as follows:
- 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
speeches from the European Parliament in 23 European languages.
This includes 13,600 hours of Danish speech.
- 51,000 hours from [Multilingual
LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
8 European languages. This does not include any Danish speech.
- 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
being read-aloud speech in 60 diverse languages. This does not include any Danish
speech.
- 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
being audio from YouTube videos in 107 languages. This includes 28 hours of
Danish speech.
- 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
conversational telephone speech in 17 African and Asian languages. This does not
include any Danish speech.
2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
indicates that this stage of training was supervised, i.e. the model was trained on
both audio and transcriptions to perform the speech-to-text task (also known as
automatic speech recognition). The finetuning data is as follows:
- The read-aloud training split of the [CoRal
dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
read-aloud speech, diverse across dialects, accents, ages and genders.
- The Danish training split of the [Common Voice 17
dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
consisting of 12 hours of Danish read-aloud speech.
3. An n-gram language model has been trained separately, and is used to guide the
transcription generation of the finetuned speech recognition model. This n-gram
language model has been trained on the following datasets:
- [Danish
Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
(approximately 287,000 articles).
- [Danish Common Voice 17 training
split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
(approximately 3,500 comments).
- [Danish
Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
(approximately 5 million comments).
Note that all samples from the CoRal test dataset have been removed from all of
these datasets, to ensure that the n-gram model has not seen the test data.

The first step was trained by [Babu et al.
(2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).

The final product is then the combination of the finetuned model along with the n-gram
model, and this is what is used when you use the model as mentioned in the Quick Start
section above.


## Intended use cases

This model is intended to be used for Danish automatic speech recognition.

Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
models. For more information, see addition 4 in our
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).


## Why the name Røst?

Røst is both the [Danish word for the human
voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
of the cold-water coral reefs in
Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).


## License
The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
commercial use with a few restrictions (speech synthesis and biometric identification).
See
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).


## Creators and Funders
The CoRal project is funded by the [Danish Innovation
Fund](https://innovationsfonden.dk/) and consists of the following partners:

- [Alexandra Institute](https://alexandra.dk/)
- [University of Copenhagen](https://www.ku.dk/)
- [Agency for Digital Government](https://digst.dk/)
- [Alvenir](https://www.alvenir.ai/)
- [Corti](https://www.corti.ai/)


## Citation

We will submit a research paper soon, but until then, if you use this model in your
research or development, please cite it as follows:

```bibtex
@dataset{coral2024,
author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
year = {2024},
url = {https://hf.co/datasets/alexandrainst/coral},
}
```
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ ______________________________________________________________________
[![Documentation](https://img.shields.io/badge/docs-passing-green)](https://alexandrainst.github.io/coral/coral.html)
[![License](https://img.shields.io/github/license/alexandrainst/coral)](https://github.com/alexandrainst/coral/blob/main/LICENSE)
[![LastCommit](https://img.shields.io/github/last-commit/alexandrainst/coral)](https://github.com/alexandrainst/coral/commits/main)
[![Code Coverage](https://img.shields.io/badge/Coverage-59%25-orange.svg)](https://github.com/alexandrainst/coral/tree/main/tests)
[![Code Coverage](https://img.shields.io/badge/Coverage-55%25-orange.svg)](https://github.com/alexandrainst/coral/tree/main/tests)


Developers:
Expand Down
11 changes: 9 additions & 2 deletions config/asr_finetuning.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@ defaults:
- model: wav2vec2-small
- datasets:
- coral
- decoder_datasets:
- wikipedia
- common_voice
- reddit
- override hydra/job_logging: custom
- _self_

Expand All @@ -10,7 +14,6 @@ seed: 4242
evaluation_dataset:
id: alexandrainst/coral
subset: read_aloud
split: test
val_name: val
text_column: text
audio_column: audio
Expand Down Expand Up @@ -41,7 +44,8 @@ hub_organisation: alexandrainst
push_to_hub: false
create_pr: false
private: false
fp16: true
fp16_allowed: true
bf16_allowed: true

# Training parameters
wandb: false
Expand All @@ -65,3 +69,6 @@ eval_steps: 500
save_steps: 500
early_stopping: false
early_stopping_patience: 50

# NOTE: This is automatically set to false in a multi-gpu setting
gradient_checkpointing: true
2 changes: 1 addition & 1 deletion config/datasets/common_voice_17.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ common_voice_17:
train_name: train
text_column: sentence
audio_column: audio
filter_dataset: true
filter_dataset: false
process_dataset: true
6 changes: 6 additions & 0 deletions config/decoder_datasets/common_voice.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
common_voice:
id: mozilla-foundation/common_voice_17_0
subset: da
split: train
text_column: sentence
audio_column: audio
6 changes: 6 additions & 0 deletions config/decoder_datasets/reddit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
reddit:
id: alexandrainst/scandi-reddit
subset: da
split: train
text_column: doc
audio_column: null
6 changes: 6 additions & 0 deletions config/decoder_datasets/wikipedia.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
wikipedia:
id: alexandrainst/scandi-wiki
subset: da
split: train
text_column: text
audio_column: null
8 changes: 4 additions & 4 deletions config/model/test-wav2vec2.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: test-wav2vec2
type: wav2vec2
pretrained_model_id: chcaa/xls-r-300m-danish
pretrained_model_id: facebook/wav2vec2-xls-r-300m
freeze_feature_encoder: true

# Data hyperparameters
Expand All @@ -20,7 +20,7 @@ mask_time_length: 10
mask_feature_prob: 0.5
mask_feature_length: 64
layerdrop: 0.1 # NOTE: This will automatically be set to 0 in a multi-gpu setting
ctc_loss_reduction: sum
ctc_loss_reduction: mean

# Decoder hyperparameters
decoder: null
# Decoder hyperparameters
use_decoder: false
12 changes: 4 additions & 8 deletions config/model/wav2vec2-large.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,8 @@ mask_time_length: 10
mask_feature_prob: 0.5
mask_feature_length: 64
layerdrop: 0.1 # NOTE: This will automatically be set to 0 in a multi-gpu setting
ctc_loss_reduction: sum
ctc_loss_reduction: mean

# Decoder hyperparameters
decoder:
dataset_id: alexandrainst/scandi-wiki
dataset_subset: da
dataset_split: train
text_column: text
n: 5
# Decoder hyperparameters
use_decoder: true
decoder_num_ngrams: 3
12 changes: 4 additions & 8 deletions config/model/wav2vec2-medium.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,8 @@ mask_time_length: 10
mask_feature_prob: 0.5
mask_feature_length: 64
layerdrop: 0.1 # NOTE: This will automatically be set to 0 in a multi-gpu setting
ctc_loss_reduction: sum
ctc_loss_reduction: mean

# Decoder hyperparameters
decoder:
dataset_id: alexandrainst/scandi-wiki
dataset_subset: da
dataset_split: train
text_column: text
n: 5
# Decoder hyperparameters
use_decoder: true
decoder_num_ngrams: 3
14 changes: 5 additions & 9 deletions config/model/wav2vec2-small.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: wav2vec2-small
type: wav2vec2
pretrained_model_id: chcaa/xls-r-300m-danish
pretrained_model_id: facebook/wav2vec2-xls-r-300m
freeze_feature_encoder: false

# Data hyperparameters
Expand All @@ -20,12 +20,8 @@ mask_time_length: 10
mask_feature_prob: 0.5
mask_feature_length: 64
layerdrop: 0.1 # NOTE: This will automatically be set to 0 in a multi-gpu setting
ctc_loss_reduction: sum
ctc_loss_reduction: mean

# Decoder hyperparameters
decoder:
dataset_id: alexandrainst/scandi-wiki
dataset_subset: da
dataset_split: train
text_column: text
n: 5
# Decoder hyperparameters
use_decoder: true
decoder_num_ngrams: 3
13 changes: 13 additions & 0 deletions makefile
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,16 @@ type-check: ## Run type checking
--ignore-missing-imports \
--show-error-codes \
--check-untyped-defs

roest-315m: ## Train the Røst-315M model
@accelerate launch \
--use-deepspeed \
src/scripts/finetune_asr_model.py \
model=wav2vec2-small \
datasets=[coral] \
decoder_datasets=[wikipedia,common_voice,reddit] \
push_to_hub=true \
dataloader_num_workers=4 \
model_id=roest-315m-xlsr \
private=true \
per_device_batch_size=64
9 changes: 4 additions & 5 deletions src/coral/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@

from omegaconf import DictConfig
from transformers import EarlyStoppingCallback, TrainerCallback
from wandb import finish as wandb_finish
from wandb.sdk.wandb_init import init as wandb_init
from wandb.sdk.wandb_run import finish as wandb_finish

from .data import load_data_for_finetuning
from .data_models import ModelSetup
from .model_setup import load_model_setup
from .ngram import train_ngram_model
from .ngram import train_and_store_ngram_model
from .utils import block_terminal_output, disable_tqdm, push_model_to_hub

logger = logging.getLogger(__package__)
Expand Down Expand Up @@ -58,14 +58,13 @@ def finetune(config: DictConfig) -> None:
block_terminal_output()
with disable_tqdm():
trainer.train(resume_from_checkpoint=config.resume_from_checkpoint)

if config.wandb and is_main_process:
wandb_finish()

model.save_pretrained(save_directory=config.model_dir)

if hasattr(config.model, "decoder") and config.model.decoder is not None:
train_ngram_model(config=config)
if hasattr(config.model, "use_decoder") and config.model.use_decoder:
train_and_store_ngram_model(config=config)

if config.push_to_hub:
push_model_to_hub(
Expand Down
Loading
Loading