Python code to replicate experiments described in "Supervised Acoustic Embeddings And Their Transferability Across Languages". ICNLSP 2022.
The notebooks include everything you need to train and evaluate the supervised and self-supervised AWE models. You need to provide the input data, which should be pre-processed by segmenting spoken words and extracting features. We used the Librispeech and Multilingual Librispeech datasets for English, French, German, and Spanish. We used the s3prl toolkit to extract all features. For word boundaries, we used the Montreal Forced Aligner.
Refer to the paper for more details.