How to train neural vocoders with ParallelWaveGAN

Please check Getting started with recipes and Overview of NNSVS’s SVS first.

NNSVS v0.0.3 and later supports neural vocoders. This page summarizes how to train neural vocoders.

Warning

This section needs to be re-written according to the recent major updates. Will be updated soon.

Note

The contents in this page is based on recipes/conf/spsvs/run_common_steps_dev.sh. Also, before you make your custom recipes, it is recommenced to start with a test recipe recipes/nit-song070/dev-test.

Pre-requisites

What is NSF?

NSF is an abbreviation of a neural source-filter model for speech synthesis (Wang et al. [WTY19]).

Input/output of a neural vocoder

Input: acoustic features containing only static features
Output: waveform

The current implementation of NNSVS does not use dynamic features (i.e., delta and delta-delta features) for neural vocoders. If you enabled dynamic features, you must need to extract static features before training neural vocoders.

Install a fork of parallel_wavegan

NNSVS’s neural vocoder integration is done by the external r9y9/parallel_wavegan repository that provides various GAN-based neural vocoders. To enable the neural vocoder support for NNSVS, you must need to install the nnsvs branch of the parallel_wavegan repository in advance:

pip install git+https://github.com/r9y9/ParallelWaveGAN@nnsvs

The nnsvs branch includes code for NSF. After installation, please make sure that you can access HnSincNSF class. The following code should throw no errors if the installtaion is property done.

Jupyter:

In [1]: from parallel_wavegan.models.nsf import HnSincNSF
In [2]: HnSincNSF(1,1)

Command-line:

python -c "from parallel_wavegan.models.nsf import HnSincNSF; print(HnSincNSF(1,1))"

Vocoder settings in `config.yaml`

The following settings are related to neural vocoders:

# NOTE: conf/parallel_wavegan/${vocoder_model}.yaml must exist.
vocoder_model: hn-sinc-nsf_sr48k_pwgD_test
# Pretrained checkpoint path for the vocoder model
# NOTE: if you want to try fine-tuning, please specify the path here
pretrained_vocoder_checkpoint:
# absolute/relative path to the checkpoint
# NOTE: the checkpoint is used for synthesis and packing
# This doesn't have any effect on training
vocoder_eval_checkpoint:

You can manually edit them or set them by command line like ---vocoder-model hn-sinc-nsf_sr48k_pwgD_test.

Stage 9: Prepare features for neural vocoders

The pre-processing for neural vocoders (i.e., extract static features from acoustic features) is implemented as stage 9.

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 9 --stop-stage 9

After running the above command, you will see the following output:

Converting dump/yoko/norm/out_acoustic_scaler.joblib mean/scale npy files
[mean] dim: (206,) -> (67,)
[scale] dim: (206,) -> (67,)
[var] dim: (206,) -> (67,)

If you are going to train NSF-based vocoders, please set the following parameters:

out_lf0_mean: 5.9012025218118325
out_lf0_scale: 0.2378365181913869

NOTE: If you are using the same data for training acoustic/vocoder models, the F0 statistics
for those models should be the same. If you are using different data for training
acoustic/vocoder models (e.g., training a vocoder model on a multiple DBs),
you will likely need to set different F0 statistics for acoustic/vocoder models.

After the pre-processing is property done, you can find the all the necessary features for training neural vocoders:

tree -L 3 dump/yoko/

dump/yoko/
├── norm
│   ├── dev
│   │   ├── in_acoustic
│   │   ├── in_duration
│   │   ├── in_timelag
│   │   ├── in_vocoder
│   │   ├── out_acoustic
│   │   ├── out_duration
│   │   ├── out_postfilter
│   │   └── out_timelag
│   ├── eval
│   │   ├── in_acoustic
│   │   ├── in_duration
│   │   ├── in_timelag
│   │   ├── in_vocoder
│   │   ├── out_acoustic
│   │   ├── out_duration
│   │   ├── out_postfilter
│   │   └── out_timelag
│   ├── in_acoustic_scaler.joblib
│   ├── in_duration_scaler.joblib
│   ├── in_timelag_scaler.joblib
│   ├── in_vocoder_scaler_mean.npy
│   ├── in_vocoder_scaler_scale.npy
│   ├── in_vocoder_scaler_var.npy
│   ├── out_acoustic_scaler.joblib
│   ├── out_duration_scaler.joblib
│   ├── out_postfilter_scaler.joblib
│   ├── out_timelag_scaler.joblib
│   └── train_no_dev
│       ├── in_acoustic
│       ├── in_duration
│       ├── in_timelag
│       ├── in_vocoder
│       ├── out_acoustic
│       ├── out_duration
│       ├── out_postfilter
│       └── out_timelag

Some notes:

norm/${spk}/*/in_vocoder directory contains features for neural vocoders. Note that the directory contains both the input and output features. Specifically, *-feats.npy contains static features consisting of mgc, lf0, vuv and bap; *-wave.npy contains raw waveform, respectively.
norm/in_vocoder_scaler_*.npy contains statistics used to normalize/de-normalize the input features for neural vocoders.

Stage 10: Training vocoder using parallel_wavegan

Warning

You must configure vocoder configs according to the sampling rate of the waveform and your feature extraction settings. It is strongly recommenced to go though the vocoder config before training your model. Vocoder configs for 24khz and 48kHz are available in the NNSVS repository, but can be extended for other sampling rates (e.g., 16kHz).

Once the pre-processing is done, you can train a neural vocoder by:

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 10 --stop-stage 10 \
    --vocoder-model hn-sinc-nsf_sr48k_pwgD_test

You can find available model configs in conf/parlalel_wavegan directory, or you can create your own model config. Please do make sure to set out_lf0_mean and out_lf0_scale parameters correctly.

$ tree conf/parallel_wavegan
conf/parallel_wavegan
├── hn-sinc-nsf_sr24k_pwgD.yaml
├── hn-sinc-nsf_sr48k_hifiganD.yaml
├── hn-sinc-nsf_sr48k_pwgD.yaml
└── hn-sinc-nsf_sr48k_pwgD_test.yaml

Training progress can be monitored by tensorboard. During training you can check generated waveforms in exp/${speaker name}/${vocoder config name}/predictions directory.

Stage 11: Synthesis waveforms by parallel_wavegan

Stage 11 generates waveforms using the trained neural vocoder. Please make sure to specify your model type explicitly.

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 11 --stop-stage 11 \
    --vocoder-model hn-sinc-nsf_sr48k_pwgD_test

Generated wav files can be found in exp/${speaker name}/${vocoder config name}/wav directory. To generate waveforms from a specific checkpoint, please specify the checkpoint path by --vocoder-eval-checkpoint /path/to/checkpoint.

Packing models with neural vocoder

To package all the models together, you can run the following command:

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \
    --timelag-model timelag_test \
    --duration-model duration_test \
    --acoustic-model acoustic_test \
    --vocoder_model hn-sinc-nsf_sr48k_pwgD_test

Please make sure to add --vocoder_model ${vocoder config name} to package the trained vocoder as well. You can also specify the explicit path of the trained model by --vocoder-eval-checkpoint /path/to/checkpoint.

How to use the packed model with the trained vocoder?

Please specify vocder_type="pwg" with the nnsvs.svs module. An example:

import numpy as np
import pysinsy
from nnmnkwii.io import hts
from nnsvs.pretrained import retrieve_pretrained_model
from nnsvs.svs import SPSVS
from nnsvs.util import example_xml_file

model_dir = "/path/to/your/packed/model_dir"
engine = SPSVS(model_dir)

contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over"))
labels = hts.HTSLabelFile.create_from_contexts(contexts)

wav, sr = engine.svs(labels, vocoder_type="pwg")

Available neural vocoders

In addition to NSF, any models implemented in parallel_wavegan can be used with NNSVS. For example, Parallel WaveGAN, HiFiGAN, MelGAN, etc. However, to get the best performance in singing synthesis, I’d recommend using HnSincNSF model (Wang and Yamagishi [WY19]), which is an advanced version of the original NSF (Wang et al. [WTY19]).

How to train universal vocoders?

It is possible to make an universal vocoder that generalizes well on unseen speaker’s data by training a vocoder on a large amount of mixed databases.

Training on mixed singing databases

Suppose you have multiple singing databases to train an neural vocodder on. Steps to train an universal vocoder are like:

Run NNSVS’ pre-processing for each database and combine them
Run vocoder training

That’s it. Please check the recipes in recipes/mixed for example.

Training on mixed speech and singing databases

This should be easily implemented but not yet done by myself (r9y9). I may add code and docs for this in the future.