Is there a way to make an additional float[] output showing individual phoneme times/lengths from the native DLL? #425
-
I'm over in Unity and trying to lipsync visemes to the audio output, and I can see where the function to generate the audio from the phonemized text is appending the bits of phonemes together, but the final output length of the audio clip is not the whole story in terms of phoneme/viseme pacing. Is there a way to add an output array of floats to go with the Phoneme sequence of a TTS output? No need to edit the tensor, just need an additional output with the array after it has been rendered. I imagine it would have to be done at the .cpp level and then rebuilt into the DLL? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Yes, this is fully possible but will require either (1) using the original PyTorch models or (2) re-exporting the voice models and changing the C++ code. An intermediary product of the model is the length of every phoneme. This can be returned with the audio, but will require the changes above. |
Beta Was this translation helpful? Give feedback.
-
Hey thanks for the reply. How could I re-export with the additional outputs? I've got some python experience, kind of a generalist. I've got most of a modified overload header written in the cpp files of a fork of the piper.unity repo I made, so I have that part.. worked out, I think. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to expose this in Piper python api? It would help a lot in OVOS to generate mouth movements for the Mark1 device together with the generated audio file in OVOS we need a list of phonemes + duration of each phoneme, for the most part we just use the original mimic1 TTS to generate these, but this is far from perfect as they often dont match the actual audio, if we could get piper to output this info natively it would be awesome! |
Beta Was this translation helpful? Give feedback.
The
w_ceil
variable has the phoneme lengths:piper/src/python/piper_train/vits/models.py
Line 703 in e5cb84c
Multiplying this tensor by 256 will get you the number of audio samples per phoneme.
That
w_ceil
tensor needs to be returned from theinfer
function and then also returned with the audio here:piper/src/python/piper_train/export_onnx.py
Line 60 in e5cb84c
On the C++ side, you then need to pick apart the multiple output tensors (one audio, one phoneme samples):
piper/src/cpp/piper.cpp
Line 386 in e5cb84c