Update guide

This page summarizes guides when you use the updated version of NNSVS.

v0.0.3 to v0.1.0

There are many undocumented new features and changes related to #150. Please check the Namine Ritsu’s recipes and relevant source code for the details. @r9y9 may add docmentation later but it is not guraranteed.

Breaking changes

  • train_resf0.py is renamed to train_acoustic.py. Please use train_acoustic.py for training acoustic models.

Hed

NNSVS now uses the rest note information in the preprocessing and synthesis stages. In addition, NNSVS assumes that the rest note (or equivalent phoneme) context is on the first feature in your hed file. For example, the first feature of a JP hed file should look like:

QS "C-Phone_Muon"     {*-sil+*,*-pau+*}

Please do make sure to have the rest note context on the top of the hed file.

Models

  • Diffusion-based acoustic models are now supported. Consider using it if quality matters than speed. #175

  • We’ve found that multi-stream models generally worked better than single-stream models. Please consider using multi-stream models. See also nnsvs.acoustic_models and How to choose model.

train.py: data config

  • New parameter: batch_max_frames if specified, the batch size will be automatically adjusted to fit the specified number of frames. To allow efficient use of GPU memory, please do set this value.

train_acoustic.py: data config

  • New parameter: batch_max_frames if specified, the batch size will be automatically adjusted to fit the specified number of frames. To allow efficient use of GPU memory, please do set this value.

v0.0.2 to v0.0.3

Hed

If your hed file does not contain QS features that specify voiced/unvoiced phonemes, consider adding them to tell NNSVS to generate stable V/UV sounds. For example, add the followings for a JP hed:

QS "C-VUV_Voiced" {*-a+*,*-i+*,*-u+*,*-e+*,*-o+*,*-v+*,*-b+*,*-by+*,*-m+*,*-my+*,*-w+*,*-z+*,*-j+*,*-d+*,*-dy+*,*-n+*,*-ny+*,*-N+*,*-r+*,*-ry+*,*-g+*,*-gy+*,*-y+*}
QS "C-VUV_Unvoiced"  {*-A+*,*-I+*,*-U+*,*-E+*,*-O+*,*-f+*,*-p+*,*-py+*,*-s+*,*-sh+*,*-ts+*,*-ch+*,*-t+*,*-ty+*,*-k+*,*-ky+*,*-h+*,*-hy+*}

Then NNSVS will check the flags to correct V/UV at synthesis time by default.

If your hed file contains the following QS features,

QS "C-Phone_Yuuseion" {*-a+*,*-i+*,*-u+*,*-e+*,*-o+*,*-v+*,*-b+*,*-by+*,*-m+*,*-my+*,*-w+*,*-z+*,*-j+*,*-d+*,*-dy+*,*-n+*,*-ny+*,*-N+*,*-r+*,*-ry+*,*-g+*,*-gy+*,*-y+*}
QS "C-Phone_Museion"  {*-A+*,*-I+*,*-U+*,*-E+*,*-O+*,*-f+*,*-p+*,*-py+*,*-s+*,*-sh+*,*-ts+*,*-ch+*,*-t+*,*-ty+*,*-k+*,*-ky+*,*-h+*,*-hy+*}

please rename them to C-VUV_Voiced and C-VUV_Unvoiced.

Models

  • Use MDNv2 (MDN + dropout) instead of MDN.

  • New parameter: All models now accept new argument init_type that specifies the initialization method for model parameters. Setting init_type to kaiming_normal or xavier_normal may improve convergence a bit for deep networks. Defaults to none. The implementation was taken by junyanz/pytorch-CycleGAN-and-pix2pix.

  • Deprecated: FeedForwardNet is renamed to FFN.

  • Deprecated: ResF0Conv1dResnetMDN is deprecated. You can use ResF0Conv1dMDN with use_mdn=True.

config.yaml

  • New parameter: trajectory_smoothing specifies if we apply trajectory smoothing proposed in Takamichi et al. [TKT+15]. Default is false. It is likely to have little effects unless if you use (very experimental) learned post-filters.

  • New parameter: trajectory_smoothing_cutoff specifies the cuttoff frequency for the trajectory smoothing. Default is 50 Hz. This slide is useful to know the effects of the cutoff frequency.

  • Changed: sample_rate became mandatory parameter while it was optional.

  • New parameter: *_sweeper_args and *_sweeper_n_trials specifies configurations for hyperparameter optimization. See Hyperparameter optimization with Optuna for details.

Consider adding the following to enable vocoder training:

# NOTE: conf/parallel_wavegan/${vocoder_model}.yaml must exist.
vocoder_model:
# Pretrained checkpoint path for the vocoder model
# NOTE: if you want to try fine-tuning, please specify the path here
pretrained_vocoder_checkpoint:
# absolute/relative path to the checkpoint
# NOTE: the checkpoint is used for synthesis and packing
# This doesn't have any effect on training
vocoder_eval_checkpoint:

Run.sh

train.py: train config

  • New parameter: use_amp specifies if we use mixed precision training or not. Default is false. If you have GPUs/CUDA that supports mixed precision training, you can get performance gain by setting it to true.

  • New parameter: max_train_steps specifies maximum number of training steps (not epoch). Default is -1, which means maximum number of epochs is used to check if training is finished.

  • New parameter: feats_criterion specifies where we use MSE loss or L1 loss. You can use L1 loss if you want while it was hardcoded to use MSE loss.

train.py: data config

  • New parameter: max_time_frames specifies maximum number of time frames. You can set non-negative values to limit the maximum time frames for making a mini-batch. It would be useful to workaround GPU OOM issues.

  • New parameter: filter_long_segments specifies if long segments are filtered or not. Consider to set it True when you have GPU OOM issues. Default is False.

  • New parameter: filter_num_frames specifies the threshold for filtering long segments. Default is 6000, which means segments longer than 30 sec will not be used for training.

train_resf0.py: train config

  • New parameter: use_amp specifies if we use mixed precision training or not. Default is false. If you have GPUs/CUDA that supports mixed precision training, you can get performance gain by setting it to true.

  • New parameter: max_train_steps specifies maximum number of training steps (not epoch). Default is -1, which means maximum number of epochs is used to check if training is finished.

  • New parameter: feats_criterion specifies where we use MSE loss or L1 loss. You can use L1 loss if you want while it was hardcoded to use MSE loss.

  • New parameter: pitch_reg_decay_size specifies the decay size for the pitch regularization. The larger the decay size, the smoother pitch transitions between notes are allowed during training. See Hono et al. [HHO+21] for details.

train_resf0.py: data config

  • New parameter: max_time_frames specifies maximum number of time frames. You can set non-negative values to limit the maximum time frames for making a mini-batch. It would be useful to workaround GPU OOM issues.