How to choose model
There are number of models available in NNSVS. This page describes how to choose a model if you are unsure what to use.
Best models for Namine Ritsu’s database are listed for your reference.
Time-lag model
Use nnsvs.model.MDNv2
or nnsvs.model.VariancePredictor
(with use_mdn=True
). The latter works better at least for the Namine Ritsu’s database.
Note
The best model for Namine Ritsu’s database: nnsvs.model.VariancePredictor
Duration model
Use nnsvs.model.MDNv2
or nnsvs.model.VariancePredictor
(with use_mdn=True
). The latter works better at least for the Namine Ritsu’s database.
Note
The best model for Namine Ritsu’s database: nnsvs.model.VariancePredictor
Acoustic model
Use static features only. We found that dynamic features are less beneficial (at least for Namine Ritsu’s database).
Use
nnsvs.model.Conv1dResnet
(use_mdn=True
) if you like old-style MDN-based acoustic model that was used in Namine Ritsu’s V2 model.If you have a large amount of data, use multi-stream models (e.g.,
nnsvs.acoustic_models.NPSSMultistreamParametricModel
) where each feature stream is modeled by an autoregressive decoder except V/UV feature stream. Specifically, use autoregressive models for MGCs, log-F0, and BAP features. Please refer to the Namine Ritsu’s recipes to find example model configurations. Note that autoregresive models tend to require a larger amount of training data. Consider try fine-tuning if you have less amount of data.
Multi-stream models
For F0 feature stream: use
nnsvs.acoustic_models.BiLSTMNonAttentiveDecoder
. We found that autoregressive F0 models worked better than non-autoregressive alternatives. There are a number of model parameters, but please be aware thatreduction_factor
has great impact on the modeling capability. Smaller values allow the model to capture finer grained temporal information with training instability, wheareas larger values allow the model to capture coarser grained temporal information. We recommend to usereduction_factor=4
for F0 feature stream. You may also tryreduction_factor=2
.For MGC/BAP feature streams: Use non-MDN autoregressive models (e.g.,
nnsvs.acoustic_models.BiLSTMNonAttentiveDecoder
) over MDN-based autoregressive models (e.g.,nnsvs.acoustic_models.BiLSTMMDNNonAttentiveDecoder
). As reported in Tacotorn 2 paper (Shen et al. [SPW+18]), we empirically found that non-MDN version worked better than the MDN-version.
Note
The best model for Namine Ritsu’s database: nnsvs.acoustic_models.NPSSMultistreamParametricModel
Vocoder
Note
The best model for Namine Ritsu’s database: uSFGAN
Use WORLD first. WORLD can achieve reasonably good-quality synthesis with pitch robustness. It also generalizes well on unseen speakers (singers) with no training.
If you want to maximize the quality, use uSFGAN.