A speech representation model based on BYOL-A which is trained in a self-supervised manner to leverage audio augmentation methods. Generating robust speech representations invariant to minimal differences in audio.
BYOL-S model was pretrained on a subset of AudioSet that is related to speech only. We further modified the BYOL-S network to learn from handcrafted (openSMILE) features and data-driven features engendering Hybrid BYOL-S. Hybrid BYOL-S outperformed its predecessor BYOL-S in most tasks encountered in the HEAR Competition 2021.
In this repo, we provide the weights for BYOL-S and Hybrid BYOL-S. In addition to a package called serab-byols to facilitate generating these speech representations.
- A quick demo demonstrating the extraction of Hybrid BYOL-S embeddings on a Colab notebook is available
Tested with Python 3.7 and 3.8.
Method: pip local source tree
git clone https://github.com/GasserElbanna/serab-byols.git
python3 -m pip install -e ./serab-byols
The BYOL-S model inputs log-scaled Mel-frequency spectrograms using a
64-band Mel filter. Each frame of the spectrogram is then projected to 2048
dimensions using pretrained encoder. Weights for the projection matrix were
generated by training the BYOL-S network and are stored in this repository in the
directory checkpoints
.
The (Hybrid) BYOL-S model has been trained with different encoder architectures:
- AudioNTT: Original encoder used in BYOL-A
- Resnetish34: Adapted from this repo
- CLSTM: Inspired from this paper
- CvT: Adapted from this repo
- BYOL-S/AudioNTT:
checkpoints/default2048_BYOLAs64x96-2105311814-e100-bs256-lr0003-rs42.pth
- BYOL-S/Resnetish34:
checkpoints/resnetish34_BYOLAs64x96-2105271915-e100-bs256-lr0003-rs42.pth
- Hybrid BYOL-S/CvT (Best Model):
checkpoints/cvt_s1-d1-e64_s2-d1-e256_s3-d1-e512_BYOLAs64x96-osandbyolaloss6373-e100-bs256-lr0003-rs42.pth
Audio embeddings can be computed using one of two methods: 1)
get_scene_embeddings
, or 2) get_timestamp_embeddings
.
get_scene_embeddings
accepts a batch of audio clips (list of torch tensors) and generates a single embedding
for each audio clip. This can be computed as shown below:
import torch
import serab_byols
model_name = 'cvt'
checkpoint_path = "serab-byols/checkpoints/cvt_s1-d1-e64_s2-d1-e256_s3-d1-e512_BYOLAs64x96-osandbyolaloss6373-e100-bs256-lr0003-rs42.pth"
# Load model with weights - located in the root directory of this repo
model = serab_byols.load_model(checkpoint_path, model_name)
# Create a batch of 2 white noise clips that are 2-seconds long as a dummy example
# and compute scene embeddings for each clip
audio = torch.rand((2, model.sample_rate * 2))
embeddings = serab_byols.get_scene_embeddings(audio, model)
The get_timestamp_embeddings
method works exactly the same but returns an array
of embeddings from audio segment computed every 50ms (could be changed) over the duration of the input audio. An array
of timestamps corresponding to each embedding is also returned.
import torch
import serab_byols
model_name = 'default'
checkpoint_path = 'serab-byols/checkpoints/default2048_BYOLAs64x96-2105311814-e100-bs256-lr0003-rs42.pth'
# Load model with weights - located in the root directory of this repo
model = serab_byols.load_model(checkpoint_path, model_name)
# Create a batch of 2 white noise clips that are 2-seconds long as a dummy example
# and compute scene embeddings for each clip
frame_duration = 1000 #ms
hop_size = 50 #ms
audio = torch.rand((2, model.sample_rate * 2))
embeddings, timestamps = serab_byols.get_timestamp_embeddings(audio, model, frame_duration, hop_size)
NOTE: All BYOL-S variants were pretrained on audios with sampling rate 16kHz. Make sure to resample your dataset to 16kHz to be compatible with the model's requirements.
If you are using this package please cite these papers:
@inproceedings{elbanna2022byol,
title={Byol-s: Learning self-supervised speech representations by bootstrapping},
author={Elbanna, Gasser and Scheidwasser-Clow, Neil and Kegler, Mikolaj and Beckmann, Pierre and El Hajal, Karl and Cernak, Milos},
booktitle={HEAR: Holistic Evaluation of Audio Representations},
pages={25--47},
year={2022},
organization={PMLR}
}
@article{elbanna2022hybrid,
title={Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load},
author={Elbanna, Gasser and Biryukov, Alice and Scheidwasser-Clow, Neil and Orlandic, Lara and Mainar, Pablo and Kegler, Mikolaj and Beckmann, Pierre and Cernak, Milos},
journal={arXiv preprint arXiv:2203.16637},
year={2022}
}