Update guide
This page summarizes guides when you use the updated version of NNSVS.
v0.0.3 to v0.1.0
There are many undocumented new features and changes related to #150. Please check the Namine Ritsu’s recipes and relevant source code for the details. @r9y9 may add docmentation later but it is not guraranteed.
Breaking changes
train_resf0.py is renamed to train_acoustic.py. Please use train_acoustic.py for training acoustic models.
Hed
NNSVS now uses the rest note information in the preprocessing and synthesis stages. In addition, NNSVS assumes that the rest note (or equivalent phoneme) context is on the first feature in your hed file. For example, the first feature of a JP hed file should look like:
QS "C-Phone_Muon" {*-sil+*,*-pau+*}
Please do make sure to have the rest note context on the top of the hed file.
Models
Diffusion-based acoustic models are now supported. Consider using it if quality matters than speed. #175
We’ve found that multi-stream models generally worked better than single-stream models. Please consider using multi-stream models. See also nnsvs.acoustic_models and How to choose model.
train.py: data config
New parameter:
batch_max_frames
if specified, the batch size will be automatically adjusted to fit the specified number of frames. To allow efficient use of GPU memory, please do set this value.
train_acoustic.py: data config
New parameter:
batch_max_frames
if specified, the batch size will be automatically adjusted to fit the specified number of frames. To allow efficient use of GPU memory, please do set this value.
v0.0.2 to v0.0.3
Hed
If your hed file does not contain QS features that specify voiced/unvoiced phonemes, consider adding them to tell NNSVS to generate stable V/UV sounds. For example, add the followings for a JP hed:
QS "C-VUV_Voiced" {*-a+*,*-i+*,*-u+*,*-e+*,*-o+*,*-v+*,*-b+*,*-by+*,*-m+*,*-my+*,*-w+*,*-z+*,*-j+*,*-d+*,*-dy+*,*-n+*,*-ny+*,*-N+*,*-r+*,*-ry+*,*-g+*,*-gy+*,*-y+*}
QS "C-VUV_Unvoiced" {*-A+*,*-I+*,*-U+*,*-E+*,*-O+*,*-f+*,*-p+*,*-py+*,*-s+*,*-sh+*,*-ts+*,*-ch+*,*-t+*,*-ty+*,*-k+*,*-ky+*,*-h+*,*-hy+*}
Then NNSVS will check the flags to correct V/UV at synthesis time by default.
If your hed file contains the following QS features,
QS "C-Phone_Yuuseion" {*-a+*,*-i+*,*-u+*,*-e+*,*-o+*,*-v+*,*-b+*,*-by+*,*-m+*,*-my+*,*-w+*,*-z+*,*-j+*,*-d+*,*-dy+*,*-n+*,*-ny+*,*-N+*,*-r+*,*-ry+*,*-g+*,*-gy+*,*-y+*}
QS "C-Phone_Museion" {*-A+*,*-I+*,*-U+*,*-E+*,*-O+*,*-f+*,*-p+*,*-py+*,*-s+*,*-sh+*,*-ts+*,*-ch+*,*-t+*,*-ty+*,*-k+*,*-ky+*,*-h+*,*-hy+*}
please rename them to C-VUV_Voiced
and C-VUV_Unvoiced
.
Models
Use
MDNv2
(MDN + dropout) instead ofMDN
.New parameter: All models now accept new argument
init_type
that specifies the initialization method for model parameters. Settinginit_type
tokaiming_normal
orxavier_normal
may improve convergence a bit for deep networks. Defaults tonone
. The implementation was taken by junyanz/pytorch-CycleGAN-and-pix2pix.Deprecated:
FeedForwardNet
is renamed toFFN
.Deprecated:
ResF0Conv1dResnetMDN
is deprecated. You can useResF0Conv1dMDN
withuse_mdn=True
.
config.yaml
New parameter:
trajectory_smoothing
specifies if we apply trajectory smoothing proposed in Takamichi et al. [TKT+15]. Default is false. It is likely to have little effects unless if you use (very experimental) learned post-filters.New parameter:
trajectory_smoothing_cutoff
specifies the cuttoff frequency for the trajectory smoothing. Default is 50 Hz. This slide is useful to know the effects of the cutoff frequency.Changed:
sample_rate
became mandatory parameter while it was optional.New parameter:
*_sweeper_args
and*_sweeper_n_trials
specifies configurations for hyperparameter optimization. See Hyperparameter optimization with Optuna for details.
Consider adding the following to enable vocoder training:
# NOTE: conf/parallel_wavegan/${vocoder_model}.yaml must exist.
vocoder_model:
# Pretrained checkpoint path for the vocoder model
# NOTE: if you want to try fine-tuning, please specify the path here
pretrained_vocoder_checkpoint:
# absolute/relative path to the checkpoint
# NOTE: the checkpoint is used for synthesis and packing
# This doesn't have any effect on training
vocoder_eval_checkpoint:
Run.sh
Consider adding model packing stage 99. See Getting started with recipes for details.
Consider adding post-filter related steps. See How to train post-filters for details.
Consider adding vocoder related steps. See How to train neural vocoders with ParallelWaveGAN for details.
train.py: train config
New parameter:
use_amp
specifies if we use mixed precision training or not. Default is false. If you have GPUs/CUDA that supports mixed precision training, you can get performance gain by setting it to true.New parameter:
max_train_steps
specifies maximum number of training steps (not epoch). Default is -1, which means maximum number of epochs is used to check if training is finished.New parameter:
feats_criterion
specifies where we use MSE loss or L1 loss. You can use L1 loss if you want while it was hardcoded to use MSE loss.
train.py: data config
New parameter:
max_time_frames
specifies maximum number of time frames. You can set non-negative values to limit the maximum time frames for making a mini-batch. It would be useful to workaround GPU OOM issues.New parameter:
filter_long_segments
specifies if long segments are filtered or not. Consider to set it True when you have GPU OOM issues. Default is False.New parameter:
filter_num_frames
specifies the threshold for filtering long segments. Default is 6000, which means segments longer than 30 sec will not be used for training.
train_resf0.py: train config
New parameter:
use_amp
specifies if we use mixed precision training or not. Default is false. If you have GPUs/CUDA that supports mixed precision training, you can get performance gain by setting it to true.New parameter:
max_train_steps
specifies maximum number of training steps (not epoch). Default is -1, which means maximum number of epochs is used to check if training is finished.New parameter:
feats_criterion
specifies where we use MSE loss or L1 loss. You can use L1 loss if you want while it was hardcoded to use MSE loss.New parameter:
pitch_reg_decay_size
specifies the decay size for the pitch regularization. The larger the decay size, the smoother pitch transitions between notes are allowed during training. See Hono et al. [HHO+21] for details.
train_resf0.py: data config
New parameter:
max_time_frames
specifies maximum number of time frames. You can set non-negative values to limit the maximum time frames for making a mini-batch. It would be useful to workaround GPU OOM issues.