text cleaner from https://github.com/CjangCjengh/vits
original repo: https://github.com/jaywalnut310/vits
See vits-finetuning
(Suggestion) Python == 3.7
Only Japanese datasets can be used for fine-tuning in this repo.
git clone https://github.com/SayaSS/vits-finetuning.git
pip install -r requirements.txt
- G_0.pth
- D_0.pth
- Edit "model_dir"(line 152) in utils.py
- Put pre-trained models in the "model_dir"/checkpoints
- Speaker ID should be between 0-803.
- About 50 audio-text pairs will suffice and 100-600 epochs could have quite good performance, but more data may be better.
- Resample all audio to 22050Hz, 16-bit, mono wav files.
path/to/XXX.wav|speaker id|transcript
- Example
dataset/001.wav|10|こんにちは。
For complete examples, please see filelists/miyu_train.txt and filelists/miyu_val.txt.
python preprocess.py --filelists path/to/filelist_train.txt path/to/filelist_val.txt
Edit "training_files" and "validation_files" in configs/config.json
cd monotonic_align
python setup.py build_ext --inplace
cd ..
# Mutiple speakers
python train_ms.py -c configs/config.json -m checkpoints