nnsvs.model

Generic models that can be used for time-lag/duration/acoustic models.

FFN

class nnsvs.model.FFN(in_dim, hidden_dim, out_dim, num_layers=2, dropout=0.0, init_type='none', last_sigmoid=False)[source]

Feed-forward network

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • dropout (float) – dropout rate

  • init_type (str) – the type of weight initialization

  • last_sigmoid (bool) – whether to apply sigmoid on the output

LSTMRNN

class nnsvs.model.LSTMRNN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, init_type='none')[source]

LSTM-based recurrent neural network

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • bidirectional (bool) – whether to use bidirectional LSTM

  • dropout (float) – dropout rate

  • init_type (str) – the type of weight initialization

LSTMRNNSAR

class nnsvs.model.LSTMRNNSAR(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, stream_sizes=None, ar_orders=None, init_type='none')[source]

LSTM-RNN with shallow AR structure

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • bidirectional (bool) – whether to use bidirectional LSTM

  • dropout (float) – dropout rate

  • stream_sizes (list) – Stream sizes

  • ar_orders (list) – Filter dimensions for each stream.

  • init_type (str) – the type of weight initialization

Conv1dResnet

class nnsvs.model.Conv1dResnet(in_dim, hidden_dim, out_dim, num_layers=4, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, **kwargs)[source]

Conv1d + Resnet

The model is inspired by the MelGAN’s model architecture (Kumar et al. [KKdB+19]). MDN layer is added if use_mdn is True.

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • init_type (str) – the type of weight initialization

  • use_mdn (bool) – whether to use MDN or not

  • num_gaussians (int) – the number of gaussians in MDN

  • dim_wise (bool) – whether to use dim-wise or not

  • in_ph_start_idx (int) – the start index of phoneme identity in a hed file

  • in_ph_end_idx (int) – the end index of phoneme identity in a hed file

  • embed_dim (int) – the dimension of the phoneme embedding

Conv1dResnetMDN

class nnsvs.model.Conv1dResnetMDN(in_dim, hidden_dim, out_dim, num_layers=4, num_gaussians=8, dim_wise=False, init_type='none', **kwargs)[source]

Conv1dResnet with MDN output layer

Conv1dResnetSAR

class nnsvs.model.Conv1dResnetSAR(in_dim, hidden_dim, out_dim, num_layers=4, stream_sizes=None, ar_orders=None, init_type='none', **kwargs)[source]

Conv1dResnet with shallow AR structure

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • stream_sizes (list) – Stream sizes

  • ar_orders (list) – Filter dimensions for each stream.

  • init_type (str) – the type of weight initialization

MDN

class nnsvs.model.MDN(in_dim, hidden_dim, out_dim, num_layers=1, num_gaussians=8, dim_wise=False, init_type='none', **kwargs)[source]

Mixture density networks (MDN) with FFN

Warning

It is recommended to use MDNv2 instead, unless you want to fine-turn from a old checkpoint of MDN.

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • num_gaussians (int) – the number of gaussians

  • dim_wise (bool) – whether to use dimension-wise or not

  • init_type (str) – the type of weight initialization

MDNv2

class nnsvs.model.MDNv2(in_dim, hidden_dim, out_dim, num_layers=1, dropout=0.5, num_gaussians=8, dim_wise=False, init_type='none')[source]

Mixture density networks (MDN) with FFN

MDN (v1) + Dropout

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • dropout (float) – dropout rate

  • num_gaussians (int) – the number of gaussians

  • dim_wise (bool) – whether to use dimension-wise or not

  • init_type (str) – the type of weight initialization

RMDN

class nnsvs.model.RMDN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, num_gaussians=8, dim_wise=False, init_type='none')[source]

RNN-based mixture density networks (MDN)

Parameters:
  • in_dim (int) – the dimension of the input

  • hidden_dim (int) – the dimension of the hidden state

  • out_dim (int) – the dimension of the output

  • num_layers (int) – the number of layers

  • bidirectional (bool) – whether to use bidirectional LSTM

  • dropout (float) – dropout rate

  • num_gaussians (int) – the number of gaussians

  • dim_wise (bool) – whether to use dimension-wise or not

  • init_type (str) – the type of weight initialization

FFConvLSTM

class nnsvs.model.FFConvLSTM(in_dim, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, out_dim=67, dropout=0.0, num_lstm_layers=2, bidirectional=True, init_type='none', use_mdn=False, dim_wise=True, num_gaussians=4, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]

FFN + Conv1d + LSTM

A model proposed in Hono et al. [HHO+21] without residual F0 prediction.

Parameters:
  • in_dim (int) – the dimension of the input

  • ff_hidden_dim (int) – the dimension of the hidden state of the FFN

  • conv_hidden_dim (int) – the dimension of the hidden state of the conv1d

  • lstm_hidden_dim (int) – the dimension of the hidden state of the LSTM

  • out_dim (int) – the dimension of the output

  • dropout (float) – dropout rate

  • num_lstm_layers (int) – the number of layers of the LSTM

  • bidirectional (bool) – whether to use bidirectional LSTM

  • init_type (str) – the type of weight initialization

  • use_mdn (bool) – whether to use MDN or not

  • dim_wise (bool) – whether to use dimension-wise or not

  • num_gaussians (int) – the number of gaussians

  • in_ph_start_idx (int) – the start index of phoneme identity in a hed file

  • in_ph_end_idx (int) – the end index of phoneme identity in a hed file

  • embed_dim (int) – the dimension of the phoneme embedding

LSTMEncoder

class nnsvs.model.LSTMEncoder(in_dim: int, hidden_dim: int, out_dim: int, num_layers: int = 1, bidirectional: bool = True, dropout: float = 0.0, init_type: str = 'none', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]

LSTM encoder

A simple LSTM-based encoder

Parameters:
  • in_dim (int) – the input dimension

  • hidden_dim (int) – the hidden dimension

  • out_dim (int) – the output dimension

  • num_layers (int) – the number of layers

  • bidirectional (bool) – whether to use bidirectional or not

  • dropout (float) – the dropout rate

  • init_type (str) – the initialization type

  • in_ph_start_idx (int) – the start index of phonetic context in a hed file

  • in_ph_end_idx (int) – the end index of phonetic context in a hed file

  • embed_dim (int) – the embedding dimension

VariancePredictor

class nnsvs.model.VariancePredictor(in_dim, out_dim, num_layers=5, hidden_dim=256, kernel_size=5, dropout=0.5, init_type='none', use_mdn=False, num_gaussians=1, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, mask_indices=None)[source]

Variance predictor in Ren et al. [RHQ+21].

The model is composed of stacks of Conv1d + ReLU + LayerNorm layers. The model can be used for duration or pitch prediction.

Parameters:
  • in_dim (int) – the input dimension

  • out_dim (int) – the output dimension

  • num_layers (int) – the number of layers

  • hidden_dim (int) – the hidden dimension

  • kernel_size (int) – the kernel size

  • dropout (float) – the dropout rate

  • init_type (str) – the initialization type

  • use_mdn (bool) – whether to use MDN or not

  • num_gaussians (int) – the number of gaussians

  • dim_wise (bool) – whether to use dim-wise or not

  • in_ph_start_idx (int) – the start index of phoneme identity in a hed file

  • in_ph_end_idx (int) – the end index of phoneme identity in a hed file

  • embed_dim (int) – the dimension of the phoneme embedding

  • mask_indices (list) – the input feature indices to be masked. e.g., specify pitch_idx to mask pitch features.

TransformerEncoder

class nnsvs.model.TransformerEncoder(in_dim, out_dim, hidden_dim, attention_dim, num_heads=2, num_layers=2, kernel_size=3, dropout=0.1, reduction_factor=1, init_type='none', downsample_by_conv=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None)[source]

Transformer encoder

Warning

So far this is not well tested. Maybe be removed in the future.

Parameters:
  • in_dim (int) – the input dimension

  • out_dim (int) – the output dimension

  • hidden_dim (int) – the hidden dimension

  • attention_dim (int) – the attention dimension

  • num_heads (int) – the number of heads

  • num_layers (int) – the number of layers

  • kernel_size (int) – the kernel size

  • dropout (float) – the dropout rate

  • reduction_factor (int) – the reduction factor

  • init_type (str) – the initialization type

  • downsample_by_conv (bool) – whether to use convolutional downsampling or not

  • in_ph_start_idx (int) – the start index of phonetic context in a hed file

  • in_ph_end_idx (int) – the end index of phonetic context in a hed file

  • embed_dim (int) – the embedding dimension