BYOL for Speech (BYOL-S)

A speech representation model based on BYOL-A which is trained in a self-supervised manner to leverage audio augmentation methods. Generating robust speech representations invariant to minimal differences in audio.

BYOL-S model was pretrained on a subset of AudioSet that is related to speech only. We further modified the BYOL-S network to learn from handcrafted (openSMILE) features and data-driven features engendering Hybrid BYOL-S. Hybrid BYOL-S outperformed its predecessor BYOL-S in most tasks encountered in the HEAR Competition 2021.

In this repo, we provide the weights for BYOL-S and Hybrid BYOL-S. In addition to a package called serab-byols to facilitate generating these speech representations.

Demo

A quick demo demonstrating the extraction of Hybrid BYOL-S embeddings on a Colab notebook is available

Installation

Tested with Python 3.7 and 3.8.

Method: pip local source tree

git clone https://github.com/GasserElbanna/serab-byols.git
python3 -m pip install -e ./serab-byols

(Hybrid) BYOL-S Model

The BYOL-S model inputs log-scaled Mel-frequency spectrograms using a 64-band Mel filter. Each frame of the spectrogram is then projected to 2048 dimensions using pretrained encoder. Weights for the projection matrix were generated by training the BYOL-S network and are stored in this repository in the directory checkpoints.

Encoders:

The (Hybrid) BYOL-S model has been trained with different encoder architectures:

AudioNTT: Original encoder used in BYOL-A
Resnetish34: Adapted from this repo
CLSTM: Inspired from this paper
CvT: Adapted from this repo

Weights in the repo

BYOL-S/AudioNTT: checkpoints/default2048_BYOLAs64x96-2105311814-e100-bs256-lr0003-rs42.pth
BYOL-S/Resnetish34: checkpoints/resnetish34_BYOLAs64x96-2105271915-e100-bs256-lr0003-rs42.pth
Hybrid BYOL-S/CvT (Best Model): checkpoints/cvt_s1-d1-e64_s2-d1-e256_s3-d1-e512_BYOLAs64x96-osandbyolaloss6373-e100-bs256-lr0003-rs42.pth

Usage

Audio embeddings can be computed using one of two methods: 1) get_scene_embeddings, or 2) get_timestamp_embeddings.

get_scene_embeddings accepts a batch of audio clips (list of torch tensors) and generates a single embedding for each audio clip. This can be computed as shown below:

import torch
import serab_byols

model_name = 'cvt'
checkpoint_path = "serab-byols/checkpoints/cvt_s1-d1-e64_s2-d1-e256_s3-d1-e512_BYOLAs64x96-osandbyolaloss6373-e100-bs256-lr0003-rs42.pth"
# Load model with weights - located in the root directory of this repo
model = serab_byols.load_model(checkpoint_path, model_name)

# Create a batch of 2 white noise clips that are 2-seconds long as a dummy example
# and compute scene embeddings for each clip
audio = torch.rand((2, model.sample_rate * 2))
embeddings = serab_byols.get_scene_embeddings(audio, model)

The get_timestamp_embeddings method works exactly the same but returns an array of embeddings from audio segment computed every 50ms (could be changed) over the duration of the input audio. An array of timestamps corresponding to each embedding is also returned.

import torch
import serab_byols

model_name = 'default'
checkpoint_path = 'serab-byols/checkpoints/default2048_BYOLAs64x96-2105311814-e100-bs256-lr0003-rs42.pth'
# Load model with weights - located in the root directory of this repo
model = serab_byols.load_model(checkpoint_path, model_name)

# Create a batch of 2 white noise clips that are 2-seconds long as a dummy example
# and compute scene embeddings for each clip
frame_duration = 1000 #ms
hop_size = 50 #ms
audio = torch.rand((2, model.sample_rate * 2))
embeddings, timestamps = serab_byols.get_timestamp_embeddings(audio, model, frame_duration, hop_size)

NOTE: All BYOL-S variants were pretrained on audios with sampling rate 16kHz. Make sure to resample your dataset to 16kHz to be compatible with the model's requirements.

Citations

If you are using this package please cite these papers:

@inproceedings{elbanna2022byol,
  title={Byol-s: Learning self-supervised speech representations by bootstrapping},
  author={Elbanna, Gasser and Scheidwasser-Clow, Neil and Kegler, Mikolaj and Beckmann, Pierre and El Hajal, Karl and Cernak, Milos},
  booktitle={HEAR: Holistic Evaluation of Audio Representations},
  pages={25--47},
  year={2022},
  organization={PMLR}
}
@article{elbanna2022hybrid,
  title={Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load},
  author={Elbanna, Gasser and Biryukov, Alice and Scheidwasser-Clow, Neil and Orlandic, Lara and Mainar, Pablo and Kegler, Mikolaj and Beckmann, Pierre and Cernak, Milos},
  journal={arXiv preprint arXiv:2203.16637},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
__pycache__		__pycache__
byol_a		byol_a
checkpoints		checkpoints
serab_byols		serab_byols
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BYOL for Speech (BYOL-S)

Demo

Installation

(Hybrid) BYOL-S Model

Encoders:

Weights in the repo

Usage

Citations

About

Releases

Packages

Languages

License

GasserElbanna/serab-byols

Folders and files

Latest commit

History

Repository files navigation

BYOL for Speech (BYOL-S)

Demo

Installation

(Hybrid) BYOL-S Model

Encoders:

Weights in the repo

Usage

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages