nnsvs.acoustic_models

Sinsy-based models

ResSkipF0FFConvLSTM

class nnsvs.acoustic_models.ResSkipF0FFConvLSTM(in_dim, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, out_dim=199, dropout=0.0, num_lstm_layers=2, bidirectional=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, skip_inputs=False, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False)[source]

FFN + Conv1d + LSTM + residual/skip connections

A model proposed in Hono et al. [HHO+21].

Parameters:

in_dim (int) – input dimension
ff_hidden_dim (int) – hidden dimension of feed-forward layer
conv_hidden_dim (int) – hidden dimension of convolutional layer
lstm_hidden_dim (int) – hidden dimension of LSTM layer
out_dim (int) – output dimension
dropout (float) – dropout rate
num_ls (int) – number of layers of LSTM
bidirectional (bool) – whether to use bidirectional LSTM or not
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum of lf0 in the training data of input features
in_lf0_max (float) – maximum of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
skip_inputs (bool) – whether to use skip connection for the input features
init_type (str) – initialization type
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – number of gaussians in MDN
dim_wise (bool) – whether to use MDN with dim-wise or not

Tacotron-based models

Duration-informed Tacotron (Okamoto et al. [OTSK19]) based models.

NonAttentiveDecoder

class nnsvs.acoustic_models.NonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, postnet_layers=0, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.0, init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

Non-attentive autoregresive model based on the duration-informed Tacotron

Duration-informed Tacotron Okamoto et al. [OTSK19].

Note

if the target features of the decoder is normalized to N(0, 1), consider setting the initial value carefully so that it roughly matches the value of silence. e.g., -4 to -10. initial_value=0 works okay for large databases but I found that -4 or lower worked better for smaller databases such as nit-song070.

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
layers (int) – Number of LSTM layers.
hidden_dim (int) – Hidden dimension of LSTM.
prenet_layers (int) – Number of prenet layers.
prenet_hidden_dim (int) – Hidden dimension of prenet.
prenet_dropout (float) – Dropout rate of prenet.
zoneout (float) – Zoneout rate.
reduction_factor (int) – Reduction factor.
downsample_by_conv (bool) – If True, downsampling is performed by convolution.
postnet_layers (int) – Number of postnet layers.
postnet_channels (int) – Number of postnet channels.
postnet_kernel_size (int) – Kernel size of postnet.
postnet_dropout (float) – Dropout rate of postnet.
init_type (str) – Initialization type.
eval_dropout (bool) – If True, dropout is applied in evaluation.
initial_value (float) – initial value for the autoregressive decoder.

MDNNonAttentiveDecoder

class nnsvs.acoustic_models.MDNNonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, num_gaussians=8, sampling_mode='mean', init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

Non-atteive decoder with MDN

Each decoder step outputs the parameters of MDN.

Parameters:

in_dim (int) – input dimension
out_dim (int) – output dimension
layers (int) – number of LSTM layers
hidden_dim (int) – hidden dimension
prenet_layers (int) – number of prenet layers
prenet_hidden_dim (int) – prenet hidden dimension
prenet_dropout (float) – prenet dropout rate
zoneout (float) – zoneout rate
reduction_factor (int) – reduction factor
downsample_by_conv (bool) – if True, use conv1d to downsample the input
num_gaussians (int) – number of Gaussians
sampling_mode (str) – sampling mode
init_type (str) – initialization type
eval_dropout (bool) – if True, use dropout in evaluation
initial_value (float) – initial value for the autoregressive decoder.

BiLSTMNonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMNonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, postnet_layers=0, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.0, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none', eval_dropout=True, prenet_noise_std=0.0, initial_value=0.0)[source]

BiLSTM-based encoder + NonAttentiveDecoder

The encoder is based on the arthitecture of the Sinsy acoustic model.

Parameters:

in_dim (int) – Input dimension.
ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.
conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.
lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.
num_lstm_layers (int) – Number of LSTM layers in the encoder.
out_dim (int) – Output dimension.
layers (int) – Number of LSTM layers.
hidden_dim (int) – Hidden dimension of LSTM.
prenet_layers (int) – Number of prenet layers.
prenet_hidden_dim (int) – Hidden dimension of prenet.
prenet_dropout (float) – Dropout rate of prenet.
zoneout (float) – Zoneout rate.
reduction_factor (int) – Reduction factor.
downsample_by_conv (bool) – If True, downsampling is performed by convolution.
postnet_layers (int) – Number of postnet layers.
postnet_channels (int) – Number of postnet channels.
postnet_kernel_size (int) – Kernel size of postnet.
postnet_dropout (float) – Dropout rate of postnet.
in_ph_start_idx (int) – Start index of phoneme features.
in_ph_end_idx (int) – End index of phoneme features.
embed_dim (int) – Embedding dimension.
init_type (str) – Initialization type.
eval_dropout (bool) – If True, dropout is applied in evaluation.
initial_value (float) – initial value for the autoregressive decoder.

BiLSTMMDNNonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMMDNNonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, num_gaussians=8, sampling_mode='mean', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none', eval_dropout=True, prenet_noise_std=0, initial_value=0.0)[source]

BiLSTM-based encoder + NonAttentiveDecoder (MDN version)

The encoder is based on the arthitecture of the Sinsy acoustic model.

Parameters:

in_dim (int) – Input dimension.
ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.
conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.
lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.
num_lstm_layers (int) – Number of LSTM layers in the encoder.
out_dim (int) – Output dimension.
layers (int) – Number of LSTM layers.
hidden_dim (int) – Hidden dimension of LSTM.
prenet_layers (int) – Number of prenet layers.
prenet_hidden_dim (int) – Hidden dimension of prenet.
prenet_dropout (float) – Dropout rate of prenet.
zoneout (float) – Zoneout rate.
reduction_factor (int) – Reduction factor.
downsample_by_conv (bool) – If True, downsampling is performed by convolution.
num_gaussians (int) – Number of Gaussians.
sampling_mode (str) – Sampling mode.
postnet_layers (int) – Number of postnet layers.
postnet_channels (int) – Number of postnet channels.
postnet_kernel_size (int) – Kernel size of postnet.
postnet_dropout (float) – Dropout rate of postnet.
in_ph_start_idx (int) – Start index of phoneme features.
in_ph_end_idx (int) – End index of phoneme features.
embed_dim (int) – Embedding dimension.
init_type (str) – Initialization type.
eval_dropout (bool) – If True, dropout is applied in evaluation.
initial_value (float) – initial value for the autoregressive decoder.

Tacotron-based F0 models

ResF0NonAttentiveDecoder

class nnsvs.acoustic_models.ResF0NonAttentiveDecoder(in_dim=512, out_dim=1, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', eval_dropout=True)[source]

Duration-informed Tacotron with residual F0 prediction.

Parameters:

in_dim (int) – dimension of encoder hidden layer
out_dim (int) – dimension of output
layers (int) – number of LSTM layers
hidden_dim (int) – dimension of hidden layer
prenet_layers (int) – number of pre-net layers
prenet_hidden_dim (int) – dimension of pre-net hidden layer
prenet_dropout (float) – dropout rate of pre-net
zoneout (float) – zoneout rate
reduction_factor (int) – reduction factor
downsample_by_conv (bool) – if True, downsample by convolution
scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum of lf0 in the training data of input features
in_lf0_max (float) – maximum of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
init_type (str) – initialization type
eval_dropout (bool) – if True, use dropout in evaluation

MDNResF0NonAttentiveDecoder

class nnsvs.acoustic_models.MDNResF0NonAttentiveDecoder(in_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, num_gaussians=4, sampling_mode='mean', in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', eval_dropout=True)[source]

Duration-informed Tacotron with residual F0 prediction (MDN-version)

Parameters:

in_dim (int) – dimension of encoder hidden layer
out_dim (int) – dimension of output
layers (int) – number of LSTM layers
hidden_dim (int) – dimension of hidden layer
prenet_layers (int) – number of pre-net layers
prenet_hidden_dim (int) – dimension of pre-net hidden layer
prenet_dropout (float) – dropout rate of pre-net
zoneout (float) – zoneout rate
reduction_factor (int) – reduction factor
downsample_by_conv (bool) – if True, downsample by convolution
scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction
num_gaussians (int) – number of Gaussian
sampling_mode (str) – sampling mode
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum of lf0 in the training data of input features
in_lf0_max (float) – maximum of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
init_type (str) – initialization type
eval_dropout (bool) – if True, use dropout in evaluation

BiLSTMResF0NonAttentiveDecoder

class nnsvs.acoustic_models.BiLSTMResF0NonAttentiveDecoder(in_dim=512, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, num_lstm_layers=2, dropout=0.0, out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, downsample_by_conv=False, scaled_tanh=True, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, use_mdn=False, num_gaussians=4, sampling_mode='mean', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, init_type='none')[source]

BiLSTM-based encoder + duration-informed Tacotron with residual F0 prediction.

Parameters:

in_dim (int) – dimension of encoder hidden layer
ff_hidden_dim (int) – Hidden dimension of feed-forward layers in the encoder.
conv_hidden_dim (int) – Hidden dimension of convolution layers in the encoder.
lstm_hidden_dim (int) – Hidden dimension of LSTM layers in the encoder.
num_lstm_layers (int) – Number of LSTM layers in the encoder.
out_dim (int) – dimension of output
layers (int) – number of LSTM layers
hidden_dim (int) – dimension of hidden layer
prenet_layers (int) – number of pre-net layers
prenet_hidden_dim (int) – dimension of pre-net hidden layer
prenet_dropout (float) – dropout rate of pre-net
zoneout (float) – zoneout rate
reduction_factor (int) – reduction factor
downsample_by_conv (bool) – if True, downsample by convolution
scaled_tanh (bool) – if True, use scaled tanh for residual F0 prediction
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum of lf0 in the training data of input features
in_lf0_max (float) – maximum of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
use_mdn (bool) – if True, use mixture density network for F0 prediction
num_gaussians (int) – number of gaussians in MDN
sampling_mode (str) – sampling mode in inference. “mean” or “random”
in_ph_start_idx (int) – Start index of phoneme features.
in_ph_end_idx (int) – End index of phoneme features.
embed_dim (int) – Embedding dimension.
init_type (str) – initialization type

Multi-stream models

MultistreamSeparateF0ParametricModel

class nnsvs.acoustic_models.MultistreamSeparateF0ParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, encoder: Module, mgc_model: Module, lf0_model: Module, vuv_model: Module, bap_model: Module, vib_model: Module | None = None, vib_flags_model: Module | None = None, in_rest_idx=1, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, lf0_teacher_forcing=True)[source]

Multi-stream model with a separate F0 prediction model

acoustic features: [MGC, LF0, VUV, BAP]

vib_model and vib_flags_model are optional and will be likely to be removed.

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
stream_sizes (list) – List of stream sizes.
reduction_factor (int) – Reduction factor.
encoder (nn.Module) – A shared encoder.
mgc_model (nn.Module) – MGC prediction model.
lf0_model (nn.Module) – log-F0 prediction model.
vuv_model (nn.Module) – V/UV prediction model.
bap_model (nn.Module) – BAP prediction model.
in_rest_idx (int) – Index of the rest symbol in the input features.
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
lf0_teacher_forcing (bool) – Whether to use teacher forcing for F0 prediction.

NPSSMultistreamParametricModel

class nnsvs.acoustic_models.NPSSMultistreamParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, lf0_model: Module, mgc_model: Module, bap_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, npss_style_conditioning=False, vuv_model_bap_conditioning=True, vuv_model_bap0_conditioning=False, vuv_model_lf0_conditioning=True, vuv_model_mgc_conditioning=False)[source]

NPSS-like cascaded multi-stream model with no mixture density networks.

NPSS: Blaauw and Bonada [BB17]

Different from the original NPSS, we don’t use spectral parameters for the inputs of aperiodicity and V/UV prediction models. This is because (1) D4C does not use spectral parameters as input for aperiodicity estimation. (2) V/UV detection is done from aperiodicity at 0-3 kHz in WORLD. In addition, f0 and VUV models dont use MDNs.

Empirically, we found the above configuration works better than the original one.

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
stream_sizes (list) – List of stream sizes.
lf0_model (BaseModel) – Model for predicting log-F0.
mgc_model (BaseModel) – Model for predicting MGC.
bap_model (BaseModel) – Model for predicting BAP.
vuv_model (BaseModel) – Model for predicting V/UV.
in_rest_idx (int) – Index of the rest symbol in the input features.
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
vuv_model_bap_conditioning (bool) – If True, use BAP features for V/UV prediction.
vuv_model_bap0_conditioning (bool) – If True, use only 0-th coef. of BAP for V/UV prediction.
vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.
vuv_model_mgc_conditioning (bool) – If True, use MGC features for V/UV prediction.

NPSSMDNMultistreamParametricModel

class nnsvs.acoustic_models.NPSSMDNMultistreamParametricModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, lf0_model: Module, mgc_model: Module, bap_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, vuv_model_bap_conditioning=True, vuv_model_bap0_conditioning=False, vuv_model_lf0_conditioning=True, vuv_model_mgc_conditioning=False)[source]

NPSS-like cascaded multi-stream parametric model with mixture density networks.

Note

This class was originally designed to be used with MDNs. However, the internal design was changed to make it work with non-MDN and diffusion models. For example, you can use non-MDN models for MGC prediction.

NPSS: Blaauw and Bonada [BB17]

acoustic features: [MGC, LF0, VUV, BAP]

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
stream_sizes (list) – List of stream sizes.
lf0_model (BaseModel) – Model for predicting log-F0.
mgc_model (BaseModel) – Model for predicting MGC.
bap_model (BaseModel) – Model for predicting BAP.
vuv_model (BaseModel) – Model for predicting V/UV.
in_rest_idx (int) – Index of the rest symbol in the input features.
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
vuv_model_bap_conditioning (bool) – If True, use BAP features for V/UV prediction.
vuv_model_bap0_conditioning (bool) – If True, use only 0-th coef. of BAP for V/UV prediction.
vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.
vuv_model_mgc_conditioning (bool) – If True, use MGC features for V/UV prediction.

MultistreamSeparateF0MelModel

class nnsvs.acoustic_models.MultistreamSeparateF0MelModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, encoder: Module, mel_model: Module, lf0_model: Module, vuv_model: Module, in_rest_idx=1, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034)[source]

Multi-stream model with a separate F0 prediction model (mel-version)

Conditional dependency: p(MEL, LF0, VUV|C) = p(LF0|C) p(MEL|LF0, C) p(VUV|LF0, C)

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
stream_sizes (list) – List of stream sizes.
reduction_factor (int) – Reduction factor.
encoder (nn.Module) – A shared encoder.
mel_model (nn.Module) – MEL prediction model.
lf0_model (nn.Module) – log-F0 prediction model.
vuv_model (nn.Module) – V/UV prediction model.
in_rest_idx (int) – Index of the rest symbol in the input features.
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features

MDNMultistreamSeparateF0MelModel

class nnsvs.acoustic_models.MDNMultistreamSeparateF0MelModel(in_dim: int, out_dim: int, stream_sizes: list, reduction_factor: int, mel_model: Module, lf0_model: Module, vuv_model: Module, in_rest_idx=0, in_lf0_idx=51, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=60, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, vuv_model_lf0_conditioning=True, vuv_model_mel_conditioning=True)[source]

Multi-stream model with a separate F0 model (mel-version) and mDN

V/UV prediction is performed given a mel-spectrogram.

Conditional dependency: p(MEL, LF0, VUV|C) = p(LF0|C) p(MEL|LF0, C) p(VUV|LF0, MEL, C)

Note

This class was originally designed to be used with MDNs. However, the internal design was changed to make it work with non-MDN and diffusion models. For example, you can use non-MDN models for mel prediction.

Parameters:

in_dim (int) – Input dimension.
out_dim (int) – Output dimension.
stream_sizes (list) – List of stream sizes.
reduction_factor (int) – Reduction factor.
encoder (nn.Module) – A shared encoder.
mel_model (nn.Module) – MEL prediction model.
lf0_model (nn.Module) – log-F0 prediction model.
vuv_model (nn.Module) – V/UV prediction model.
in_rest_idx (int) – Index of the rest symbol in the input features.
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
vuv_model_lf0_conditioning (bool) – If True, use log-F0 features for V/UV prediction.
vuv_model_mel_conditioning (bool) – If True, use mel features for V/UV prediction.

Other single-stream models

ResF0Conv1dResnet

class nnsvs.acoustic_models.ResF0Conv1dResnet(in_dim, hidden_dim, out_dim, num_layers=4, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False)[source]

Conv1d + Resnet + Residual F0 prediction

Residual F0 prediction is inspired by Hono et al. [HHO+21].

Parameters:

in_dim (int) – input dimension
hidden_dim (int) – hidden dimension
out_dim (int) – output dimension
num_layers (int) – number of layers
in_lf0_idx (int) – index of lf0 in input features
in_lf0_min (float) – minimum value of lf0 in the training data of input features
in_lf0_max (float) – maximum value of lf0 in the training data of input features
out_lf0_idx (int) – index of lf0 in output features. Typically 180.
out_lf0_mean (float) – mean of lf0 in the training data of output features
out_lf0_scale (float) – scale of lf0 in the training data of output features
init_type (str) – initialization type
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – number of gaussians in MDN
dim_wise (bool) – whether to use dimension-wise MDN or not

ResF0VariancePredictor

class nnsvs.acoustic_models.ResF0VariancePredictor(in_dim, out_dim, num_layers=5, hidden_dim=256, kernel_size=5, dropout=0.5, init_type='none', use_mdn=False, num_gaussians=1, dim_wise=False, in_lf0_idx=300, in_lf0_min=5.3936276, in_lf0_max=6.491111, out_lf0_idx=180, out_lf0_mean=5.953093881972361, out_lf0_scale=0.23435173188961034)[source]

Variance predictor in Ren et al. [RHQ+21] with residual F0 prediction

Parameters:

in_dim (int) – the input dimension
out_dim (int) – the output dimension
num_layers (int) – the number of layers
hidden_dim (int) – the hidden dimension
kernel_size (int) – the kernel size
dropout (float) – the dropout rate
in_lf0_idx (int) – the index of the input LF0
in_lf0_min (float) – the minimum value of the input LF0
in_lf0_max (float) – the maximum value of the input LF0
out_lf0_idx (int) – the index of the output LF0
out_lf0_mean (float) – the mean value of the output LF0
out_lf0_scale (float) – the scale value of the output LF0
init_type (str) – the initialization type
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dim-wise or not