Feature request: Record clips into a single voice file for Piper #7

Bugsbane · 2023-08-28T18:20:07Z

Right now, Piper Recording Studio seems to give texts to read and record the audio, but there doesn't seem any simple path for then turning that into a voice that Piper can actually use. I read the Piper doc about how to train a new voice and... well, it pretty much made my eyes bleed.

I have some actor friends who could do some really fun voices for Piper, like demon or pixie voices.

I would like to be able to open up Piper Recording Studio, click "New voice", have them read out a bunch of texts and be given a file that I can edit, share or upload to Piper (in my Home Assistant) to use as a new voice.

Currently all samples seem to be uploaded to some mysterious repo somewhere for purposes unknown. I'm happy with the CC0 licensing, but if samples are being used to train TTS then I really don't want to upload a bunch of semi/non-human sounding voice clips there!

Please allow recording a session of audio clips which are then wrapped into a single voice file (eg archive) that Piper/HA can upload and use.

Thank you!

aaronnewsome · 2023-11-29T21:34:46Z

Hello Bugsbane. I'm not an expert on this topic, but I thought I'd respond to some of your questions. Hopefully you'll find my answers useful.

From what I've been able to figure out after a few weeks of trial and error, there's a few steps to training your own voice to use with piper.

Record your voice samples
Export the voice samples to ljspeech format
Pre-process the ljspeech data
Train the voice model using piper-train
Export the training to onnx runtime

Piper Recording Studio only does the 1st and I guess also the second from the command line.

I too found the instructions for training your own voice to be really complicated and missing important steps. The youtube video at the top of the training guide was definitely helpful, but even that had some missing info.

The most important bit of info is that none of this will build correctly if you don't have the correct python version. Even in the YouTube video it's never mentioned. Tucked away in the comments it mentions you need Python 3.9-3.11. I can tell you from experience, this is wrong since some of the modules just don't build on 3.11 (or at least they didn't for me). I settled on using Python 3.10 and got everything built. I completely missed this since I watch YouTube videos on my TV, which don't show the comments.

The other bit of information that I would think to be important, is some kind of benchmarking information. I'm sure anyone who endeavors to do this, would be curious about "how long will this take". The youtube video also glosses over this.

Hold on to your hat because I can tell you it's going to take a very long time. First, if you don't have a REALLY fast GPU with a LOT of RAM, just don't bother trying to train on your local system. If you want to try training on a fast system using CPU only, also don't bother. If you want your training done quickly, use a cloud system built for this purpose.

I have one system with a supported GPU, it has 6GB of RAM and 4864 graphics cores (system is 8 core/16 threads i7, 64GB RAM). It's not a real powerhouse compared to the huge gamer GPUs available these days. Using this GPU I was able to get the following performance:

Training with 100 voice samples, I was able to resume the ryan-High dataset to 10,000 epochs in about 28 hours
Training with 500 voice samples, I was able to resume the ryan-High data to 10,000 epochs in about 4-5 days
I'm currently training with 600 voice samples from scratch and getting around 900 epochs per day. At this rate, it'll take about 11 days to train to 10,000 epochs
For CPU only training, I'm using a 24 core dual Xeon Dell R720 with 376GB of RAM. Getting ~400 epochs per day with 500 voice samples. Training to 10,000 epochs is going to take some time, so much time, it's really not worth it.

I've been doing spot checks on my currently running 600 samples training and the results aren't great. It's clear that I probably need to complete the 1,150 voice samples suggested by piper-recording-studio. Actually, I feel like I need even more for a high quality voice. Maybe closer to 2,000 samples. It's all trial and error.

Either way, I'll keep plugging away, as I'm learning a lot along the way. Hope my experience was helpful.

aaronnewsome · 2023-11-29T21:43:20Z

I'll add one more comment. At least with the docker config, your voice samples aren't sent to some mysterious repo. They are sent to your local directory that you specify when you start the container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Record clips into a single voice file for Piper #7

Feature request: Record clips into a single voice file for Piper #7

Bugsbane commented Aug 28, 2023

aaronnewsome commented Nov 29, 2023 •

edited

Loading

aaronnewsome commented Nov 29, 2023

Feature request: Record clips into a single voice file for Piper #7

Feature request: Record clips into a single voice file for Piper #7

Comments

Bugsbane commented Aug 28, 2023

aaronnewsome commented Nov 29, 2023 • edited Loading

aaronnewsome commented Nov 29, 2023

aaronnewsome commented Nov 29, 2023 •

edited

Loading