Skip to content

Chicken egg, or Baum Welch

Vaclav Hanzl edited this page Feb 6, 2023 · 5 revisions
  • Acoustic model looks at point in speech and says which phone it is. If we have it, we can compute time alignment of phones and speech.
  • Alignment of speech and phones allows us to collect example sounds of phones. Having enough examples, we can create the acoustic model.

... -> AM -> alignment -> AM -> alignment -> ...

But - what if we have none? Fake it until we make it! Starting with very bogus alignments, we just go in this loop until things work. And we have both the chicken and the egg, both nice and shiny.

In the HMM GMM world, this is called Baum-Welch reestimation. We do the same thing with NN AM. At the "fake it" start, we make quite bogus alignment in which every phone takes exactly 30ms and the utterance has exactly the same amount of silence before and after. Of course, most phones are badly displaced and the first model made from it cannot learn much. But it sort of learns the "sound of silence". The second alignment already roughly knows where the utterance starts and where it ends. In four iterations, phones are mostly in place. At ten iterations, tiny details like optional presence of glottal stop start to settle in a meaningful way. At twenty iterations, there is not much to improve anymore. Look at the NN_Train_Align Jupyter notebook for technical details. Look at NN model training for wider context when and why you might do this.