Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Record clips into a single voice file for Piper #7

Open
Bugsbane opened this issue Aug 28, 2023 · 2 comments
Open

Comments

@Bugsbane
Copy link

Right now, Piper Recording Studio seems to give texts to read and record the audio, but there doesn't seem any simple path for then turning that into a voice that Piper can actually use. I read the Piper doc about how to train a new voice and... well, it pretty much made my eyes bleed.

I have some actor friends who could do some really fun voices for Piper, like demon or pixie voices.

I would like to be able to open up Piper Recording Studio, click "New voice", have them read out a bunch of texts and be given a file that I can edit, share or upload to Piper (in my Home Assistant) to use as a new voice.

Currently all samples seem to be uploaded to some mysterious repo somewhere for purposes unknown. I'm happy with the CC0 licensing, but if samples are being used to train TTS then I really don't want to upload a bunch of semi/non-human sounding voice clips there!

Please allow recording a session of audio clips which are then wrapped into a single voice file (eg archive) that Piper/HA can upload and use.

Thank you!

@aaronnewsome
Copy link

aaronnewsome commented Nov 29, 2023

Hello Bugsbane. I'm not an expert on this topic, but I thought I'd respond to some of your questions. Hopefully you'll find my answers useful.

From what I've been able to figure out after a few weeks of trial and error, there's a few steps to training your own voice to use with piper.

  1. Record your voice samples
  2. Export the voice samples to ljspeech format
  3. Pre-process the ljspeech data
  4. Train the voice model using piper-train
  5. Export the training to onnx runtime

Piper Recording Studio only does the 1st and I guess also the second from the command line.

I too found the instructions for training your own voice to be really complicated and missing important steps. The youtube video at the top of the training guide was definitely helpful, but even that had some missing info.

The most important bit of info is that none of this will build correctly if you don't have the correct python version. Even in the YouTube video it's never mentioned. Tucked away in the comments it mentions you need Python 3.9-3.11. I can tell you from experience, this is wrong since some of the modules just don't build on 3.11 (or at least they didn't for me). I settled on using Python 3.10 and got everything built. I completely missed this since I watch YouTube videos on my TV, which don't show the comments.

The other bit of information that I would think to be important, is some kind of benchmarking information. I'm sure anyone who endeavors to do this, would be curious about "how long will this take". The youtube video also glosses over this.

Hold on to your hat because I can tell you it's going to take a very long time. First, if you don't have a REALLY fast GPU with a LOT of RAM, just don't bother trying to train on your local system. If you want to try training on a fast system using CPU only, also don't bother. If you want your training done quickly, use a cloud system built for this purpose.

I have one system with a supported GPU, it has 6GB of RAM and 4864 graphics cores (system is 8 core/16 threads i7, 64GB RAM). It's not a real powerhouse compared to the huge gamer GPUs available these days. Using this GPU I was able to get the following performance:

  • Training with 100 voice samples, I was able to resume the ryan-High dataset to 10,000 epochs in about 28 hours
  • Training with 500 voice samples, I was able to resume the ryan-High data to 10,000 epochs in about 4-5 days
  • I'm currently training with 600 voice samples from scratch and getting around 900 epochs per day. At this rate, it'll take about 11 days to train to 10,000 epochs
  • For CPU only training, I'm using a 24 core dual Xeon Dell R720 with 376GB of RAM. Getting ~400 epochs per day with 500 voice samples. Training to 10,000 epochs is going to take some time, so much time, it's really not worth it.

I've been doing spot checks on my currently running 600 samples training and the results aren't great. It's clear that I probably need to complete the 1,150 voice samples suggested by piper-recording-studio. Actually, I feel like I need even more for a high quality voice. Maybe closer to 2,000 samples. It's all trial and error.

Either way, I'll keep plugging away, as I'm learning a lot along the way. Hope my experience was helpful.

@aaronnewsome
Copy link

I'll add one more comment. At least with the docker config, your voice samples aren't sent to some mysterious repo. They are sent to your local directory that you specify when you start the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants