nnsvs.acoustic_models

Sinsy-based models

ResSkipF0FFConvLSTM

class nnsvs.acoustic_models.ResSkipF0FFConvLSTM(in_dim, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, out_dim=199, dropout=0.0, num_lstm_layers=2, bidirectional=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, skip_inputs=False, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False)[source]

FFN + Conv1d + LSTM + residual/skip connections

A model proposed in Hono et al. [HHO+21].

Parameters:
  • in_dim (int) – input dimension

  • ff_hidden_dim (int) – hidden dimension of feed-forward layer

  • conv_hidden_dim (int) – hidden dimension of convolutional layer

  • lstm_hidden_dim (int) – hidden dimension of LSTM layer

  • out_dim (int) – output dimension

  • dropout (float) – dropout rate

  • num_ls (int) – number of layers of LSTM

  • bidirectional (bool) – whether to use bidirectional LSTM or not

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum of lf0 in the training data of input features

  • in_lf0_max (float) – maximum of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • skip_inputs (bool) – whether to use skip connection for the input features

  • init_type (str) – initialization type

  • use_mdn (bool) – whether to use MDN or not

  • num_gaussians (int) – number of gaussians in MDN

  • dim_wise (bool) – whether to use MDN with dim-wise or not

Tacotron-based models

Duration-informed Tacotron (Okamoto et al. [OTSK19]) based models.

NonAttentiveDecoder

class nnsvs.acoustic_models.NonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, postnet_layers=0, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.0, init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

Non-attentive autoregresive model based on the duration-informed Tacotron

Duration-informed Tacotron Okamoto et al. [OTSK19].

Note

if the target features of the decoder is normalized to N(0, 1), consider setting the initial value carefully so that it roughly matches the value of silence. e.g., -4 to -10. initial_value=0 works okay for large databases but I found that -4 or lower worked better for smaller databases such as nit-song070.

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • layers (int) – Number of LSTM layers.

  • hidden_dim (int) – Hidden dimension of LSTM.

  • prenet_layers (int) – Number of prenet layers.

  • prenet_hidden_dim (int) – Hidden dimension of prenet.

  • prenet_dropout (float) – Dropout rate of prenet.

  • zoneout (float) – Zoneout rate.

  • reduction_factor (int) – Reduction factor.

  • downsample_by_conv (bool) – If True, downsampling is performed by convolution.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_channels (int) – Number of postnet channels.

  • postnet_kernel_size (int) – Kernel size of postnet.

  • postnet_dropout (float) – Dropout rate of postnet.

  • init_type (str) – Initialization type.

  • eval_dropout (bool) – If True, dropout is applied in evaluation.

  • initial_value (float) – initial value for the autoregressive decoder.

MDNNonAttentiveDecoder

class nnsvs.acoustic_models.MDNNonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, num_gaussians=8, sampling_mode='mean', init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

Non-atteive decoder with MDN

Each decoder step outputs the parameters of MDN.

Parameters:
  • in_dim (int) – input dimension

  • out_dim (int) – output dimension

  • layers (int) – number of LSTM layers

  • hidden_dim (int) – hidden dimension

  • prenet_layers (int) – number of prenet layers

  • prenet_hidden_dim (int) – prenet hidden dimension

  • prenet_dropout (float) – prenet dropout rate

  • zoneout (float) – zoneout rate

  • reduction_factor (int) – reduction factor

  • downsample_by_conv (bool) – if True, use conv1d to downsample the input

  • num_gaussians (int) – number of Gaussians

  • sampling_mode (str) – sampling mode

  • init_type (str) – initialization type

  • eval_dropout (bool) – if True, use dropout in evaluation

  • initial_value (float) – initial value for the autoregressive decoder.

BiLSTMNonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMNonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, postnet_layers=0, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.0, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

BiLSTM-based encoder + NonAttentiveDecoder

The encoder is based on the arthitecture of the Sinsy acoustic model.

Parameters:
  • in_dim (int) – Input dimension.

  • ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.

  • conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.

  • lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.

  • num_lstm_layers (int) – Number of LSTM layers in the encoder.

  • out_dim (int) – Output dimension.

  • layers (int) – Number of LSTM layers.

  • hidden_dim (int) – Hidden dimension of LSTM.

  • prenet_layers (int) – Number of prenet layers.

  • prenet_hidden_dim (int) – Hidden dimension of prenet.

  • prenet_dropout (float) – Dropout rate of prenet.

  • zoneout (float) – Zoneout rate.

  • reduction_factor (int) – Reduction factor.

  • downsample_by_conv (bool) – If True, downsampling is performed by convolution.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_channels (int) – Number of postnet channels.

  • postnet_kernel_size (int) – Kernel size of postnet.

  • postnet_dropout (float) – Dropout rate of postnet.

  • in_ph_start_idx (int) – Start index of phoneme features.

  • in_ph_end_idx (int) – End index of phoneme features.

  • embed_dim (int) – Embedding dimension.

  • init_type (str) – Initialization type.

  • eval_dropout (bool) – If True, dropout is applied in evaluation.

  • initial_value (float) – initial value for the autoregressive decoder.

BiLSTMMDNNonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMMDNNonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, num_gaussians=8, sampling_mode='mean', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none', eval_dropout=True, prenet_noise_std=0, initial_value=0.0)[source]

BiLSTM-based encoder + NonAttentiveDecoder (MDN version)

The encoder is based on the arthitecture of the Sinsy acoustic model.

Parameters:
  • in_dim (int) – Input dimension.

  • ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.

  • conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.

  • lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.

  • num_lstm_layers (int) – Number of LSTM layers in the encoder.

  • out_dim (int) – Output dimension.

  • layers (int) – Number of LSTM layers.

  • hidden_dim (int) – Hidden dimension of LSTM.

  • prenet_layers (int) – Number of prenet layers.

  • prenet_hidden_dim (int) – Hidden dimension of prenet.

  • prenet_dropout (float) – Dropout rate of prenet.

  • zoneout (float) – Zoneout rate.

  • reduction_factor (int) – Reduction factor.

  • downsample_by_conv (bool) – If True, downsampling is performed by convolution.

  • num_gaussians (int) – Number of Gaussians.

  • sampling_mode (str) – Sampling mode.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_channels (int) – Number of postnet channels.

  • postnet_kernel_size (int) – Kernel size of postnet.

  • postnet_dropout (float) – Dropout rate of postnet.

  • in_ph_start_idx (int) – Start index of phoneme features.

  • in_ph_end_idx (int) – End index of phoneme features.

  • embed_dim (int) – Embedding dimension.

  • init_type (str) – Initialization type.

  • eval_dropout (bool) – If True, dropout is applied in evaluation.

  • initial_value (float) – initial value for the autoregressive decoder.

Tacotron-based F0 models

ResF0NonAttentiveDecoder

class nnsvs.acoustic_models.ResF0NonAttentiveDecoder(in_dim=512, out_dim=1, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', eval_dropout=True)[source]

Duration-informed Tacotron with residual F0 prediction.

Parameters:
  • in_dim (int) – dimension of encoder hidden layer

  • out_dim (int) – dimension of output

  • layers (int) – number of LSTM layers

  • hidden_dim (int) – dimension of hidden layer

  • prenet_layers (int) – number of pre-net layers

  • prenet_hidden_dim (int) – dimension of pre-net hidden layer

  • prenet_dropout (float) – dropout rate of pre-net

  • zoneout (float) – zoneout rate

  • reduction_factor (int) – reduction factor

  • downsample_by_conv (bool) – if True, downsample by convolution

  • scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum of lf0 in the training data of input features

  • in_lf0_max (float) – maximum of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • init_type (str) – initialization type

  • eval_dropout (bool) – if True, use dropout in evaluation

MDNResF0NonAttentiveDecoder

class nnsvs.acoustic_models.MDNResF0NonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, num_gaussians=4, sampling_mode='mean', in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', eval_dropout=True)[source]

Duration-informed Tacotron with residual F0 prediction (MDN-version)

Parameters:
  • in_dim (int) – dimension of encoder hidden layer

  • out_dim (int) – dimension of output

  • layers (int) – number of LSTM layers

  • hidden_dim (int) – dimension of hidden layer

  • prenet_layers (int) – number of pre-net layers

  • prenet_hidden_dim (int) – dimension of pre-net hidden layer

  • prenet_dropout (float) – dropout rate of pre-net

  • zoneout (float) – zoneout rate

  • reduction_factor (int) – reduction factor

  • downsample_by_conv (bool) – if True, downsample by convolution

  • scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction

  • num_gaussians (int) – number of Gaussian

  • sampling_mode (str) – sampling mode

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum of lf0 in the training data of input features

  • in_lf0_max (float) – maximum of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • init_type (str) – initialization type

  • eval_dropout (bool) – if True, use dropout in evaluation

BiLSTMResF0NonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMResF0NonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, dropout=0.0, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, use_mdn=False, num_gaussians=4, sampling_mode='mean', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none')[source]

BiLSTM-based encoder + duration-informed Tacotron with residual F0 prediction.

Parameters:
  • in_dim (int) – dimension of encoder hidden layer

  • ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.

  • conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.

  • lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.

  • num_lstm_layers (int) – Number of LSTM layers in the encoder.

  • out_dim (int) – dimension of output

  • layers (int) – number of LSTM layers

  • hidden_dim (int) – dimension of hidden layer

  • prenet_layers (int) – number of pre-net layers

  • prenet_hidden_dim (int) – dimension of pre-net hidden layer

  • prenet_dropout (float) – dropout rate of pre-net

  • zoneout (float) – zoneout rate

  • reduction_factor (int) – reduction factor

  • downsample_by_conv (bool) – if True, downsample by convolution

  • scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum of lf0 in the training data of input features

  • in_lf0_max (float) – maximum of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • use_mdn (bool) – if True, use mixture density network for F0 prediction

  • num_gaussians (int) – number of gaussians in MDN

  • sampling_mode (str) – sampling mode in inference. “mean” or “random”

  • in_ph_start_idx (int) – Start index of phoneme features.

  • in_ph_end_idx (int) – End index of phoneme features.

  • embed_dim (int) – Embedding dimension.

  • init_type (str) – initialization type

Multi-stream models

MultistreamSeparateF0ParametricModel

class nnsvs.acoustic_models.MultistreamSeparateF0ParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, encoder: Module, mgc_model: Module, lf0_model: Module, vuv_model: Module, bap_model: Module, vib_model: Module | None = None, vib_flags_model: Module | None = None, in_rest_idx=1, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, lf0_teacher_forcing=True)[source]

Multi-stream model with a separate F0 prediction model

acoustic features: [MGC, LF0, VUV, BAP]

vib_model and vib_flags_model are optional and will be likely to be removed.

Conditional dependency: p(MGC, LF0, VUV, BAP |C) = p(LF0|C) p(MGC|LF0, C) p(BAP|LF0, C) p(VUV|LF0, C)

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • stream_sizes (list) – List of stream sizes.

  • reduction_factor (int) – Reduction factor.

  • encoder (nn.Module) – A shared encoder.

  • mgc_model (nn.Module) – MGC prediction model.

  • lf0_model (nn.Module) – log-F0 prediction model.

  • vuv_model (nn.Module) – V/UV prediction model.

  • bap_model (nn.Module) – BAP prediction model.

  • in_rest_idx (int) – Index of the rest symbol in the input features.

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • lf0_teacher_forcing (bool) – Whether to use teacher forcing for F0 prediction.

NPSSMultistreamParametricModel

class nnsvs.acoustic_models.NPSSMultistreamParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, lf0_model: Module, mgc_model: Module, bap_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, npss_style_conditioning=False, vuv_model_bap_conditioning=True, vuv_model_bap0_conditioning=False, vuv_model_lf0_conditioning=True, vuv_model_mgc_conditioning=False)[source]

NPSS-like cascaded multi-stream model with no mixture density networks.

NPSS: Blaauw and Bonada [BB17]

Different from the original NPSS, we don’t use spectral parameters for the inputs of aperiodicity and V/UV prediction models. This is because (1) D4C does not use spectral parameters as input for aperiodicity estimation. (2) V/UV detection is done from aperiodicity at 0-3 kHz in WORLD. In addition, f0 and VUV models dont use MDNs.

Empirically, we found the above configuration works better than the original one.

Conditional dependency: p(MGC, LF0, VUV, BAP |C) = p(LF0|C) p(MGC|LF0, C) p(BAP|LF0, C) p(VUV|LF0, BAP, C)

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • stream_sizes (list) – List of stream sizes.

  • lf0_model (BaseModel) – Model for predicting log-F0.

  • mgc_model (BaseModel) – Model for predicting MGC.

  • bap_model (BaseModel) – Model for predicting BAP.

  • vuv_model (BaseModel) – Model for predicting V/UV.

  • in_rest_idx (int) – Index of the rest symbol in the input features.

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • vuv_model_bap_conditioning (bool) – If True, use BAP features for V/UV prediction.

  • vuv_model_bap0_conditioning (bool) – If True, use only 0-th coef. of BAP for V/UV prediction.

  • vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.

  • vuv_model_mgc_conditioning (bool) – If True, use MGC features for V/UV prediction.

NPSSMDNMultistreamParametricModel

class nnsvs.acoustic_models.NPSSMDNMultistreamParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, lf0_model: Module, mgc_model: Module, bap_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, vuv_model_bap_conditioning=True, vuv_model_bap0_conditioning=False, vuv_model_lf0_conditioning=True, vuv_model_mgc_conditioning=False)[source]

NPSS-like cascaded multi-stream parametric model with mixture density networks.

Note

This class was originally designed to be used with MDNs. However, the internal design was changed to make it work with non-MDN and diffusion models. For example, you can use non-MDN models for MGC prediction.

NPSS: Blaauw and Bonada [BB17]

acoustic features: [MGC, LF0, VUV, BAP]

Conditional dependency: p(MGC, LF0, VUV, BAP |C) = p(LF0|C) p(MGC|LF0, C) p(BAP|LF0, C) p(VUV|LF0, BAP, C)

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • stream_sizes (list) – List of stream sizes.

  • lf0_model (BaseModel) – Model for predicting log-F0.

  • mgc_model (BaseModel) – Model for predicting MGC.

  • bap_model (BaseModel) – Model for predicting BAP.

  • vuv_model (BaseModel) – Model for predicting V/UV.

  • in_rest_idx (int) – Index of the rest symbol in the input features.

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • vuv_model_bap_conditioning (bool) – If True, use BAP features for V/UV prediction.

  • vuv_model_bap0_conditioning (bool) – If True, use only 0-th coef. of BAP for V/UV prediction.

  • vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.

  • vuv_model_mgc_conditioning (bool) – If True, use MGC features for V/UV prediction.

MultistreamSeparateF0MelModel

class nnsvs.acoustic_models.MultistreamSeparateF0MelModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, encoder: Module, mel_model: Module, lf0_model: Module, vuv_model: Module, in_rest_idx=1, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034)[source]

Multi-stream model with a separate F0 prediction model (mel-version)

Conditional dependency: p(MEL, LF0, VUV|C) = p(LF0|C) p(MEL|LF0, C) p(VUV|LF0, C)

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • stream_sizes (list) – List of stream sizes.

  • reduction_factor (int) – Reduction factor.

  • encoder (nn.Module) – A shared encoder.

  • mel_model (nn.Module) – MEL prediction model.

  • lf0_model (nn.Module) – log-F0 prediction model.

  • vuv_model (nn.Module) – V/UV prediction model.

  • in_rest_idx (int) – Index of the rest symbol in the input features.

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

MDNMultistreamSeparateF0MelModel

class nnsvs.acoustic_models.MDNMultistreamSeparateF0MelModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, mel_model: Module, lf0_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, vuv_model_lf0_conditioning=True, vuv_model_mel_conditioning=True)[source]

Multi-stream model with a separate F0 model (mel-version) and mDN

V/UV prediction is performed given a mel-spectrogram.

Conditional dependency: p(MEL, LF0, VUV|C) = p(LF0|C) p(MEL|LF0, C) p(VUV|LF0, MEL, C)

Note

This class was originally designed to be used with MDNs. However, the internal design was changed to make it work with non-MDN and diffusion models. For example, you can use non-MDN models for mel prediction.

Parameters:
  • in_dim (int) – Input dimension.

  • out_dim (int) – Output dimension.

  • stream_sizes (list) – List of stream sizes.

  • reduction_factor (int) – Reduction factor.

  • encoder (nn.Module) – A shared encoder.

  • mel_model (nn.Module) – MEL prediction model.

  • lf0_model (nn.Module) – log-F0 prediction model.

  • vuv_model (nn.Module) – V/UV prediction model.

  • in_rest_idx (int) – Index of the rest symbol in the input features.

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.

  • vuv_model_mel_conditioning (bool) – If True, use mel features for V/UV prediction.

Other single-stream models

ResF0Conv1dResnet

class nnsvs.acoustic_models.ResF0Conv1dResnet(in_dim, hidden_dim, out_dim, num_layers=4, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False)[source]

Conv1d + Resnet + Residual F0 prediction

Residual F0 prediction is inspired by Hono et al. [HHO+21].

Parameters:
  • in_dim (int) – input dimension

  • hidden_dim (int) – hidden dimension

  • out_dim (int) – output dimension

  • num_layers (int) – number of layers

  • in_lf0_idx (int) – index of lf0 in input features

  • in_lf0_min (float) – minimum value of lf0 in the training data of input features

  • in_lf0_max (float) – maximum value of lf0 in the training data of input features

  • out_lf0_idx (int) – index of lf0 in output features. Typically 180.

  • out_lf0_mean (float) – mean of lf0 in the training data of output features

  • out_lf0_scale (float) – scale of lf0 in the training data of output features

  • init_type (str) – initialization type

  • use_mdn (bool) – whether to use MDN or not

  • num_gaussians (int) – number of gaussians in MDN

  • dim_wise (bool) – whether to use dimension-wise MDN or not

ResF0VariancePredictor

class nnsvs.acoustic_models.ResF0VariancePredictor(in_dim, out_dim, num_layers=5, hidden_dim=256, kernel_size=5, dropout=0.5, init_type='none', use_mdn=False, num_gaussians=1, dim_wise=False, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034)[source]

Variance predictor in Ren et al. [RHQ+21] with residual F0 prediction

Parameters:
  • in_dim (int) – the input dimension

  • out_dim (int) – the output dimension

  • num_layers (int) – the number of layers

  • hidden_dim (int) – the hidden dimension

  • kernel_size (int) – the kernel size

  • dropout (float) – the dropout rate

  • in_lf0_idx (int) – the index of the input LF0

  • in_lf0_min (float) – the minimum value of the input LF0

  • in_lf0_max (float) – the maximum value of the input LF0

  • out_lf0_idx (int) – the index of the output LF0

  • out_lf0_mean (float) – the mean value of the output LF0

  • out_lf0_scale (float) – the scale value of the output LF0

  • init_type (str) – the initialization type

  • use_mdn (bool) – whether to use MDN or not

  • num_gaussians (int) – the number of gaussians

  • dim_wise (bool) – whether to use dim-wise or not