nnsvs.model

Generic models that can be used for time-lag/duration/acoustic models.

FFN

class nnsvs.model.FFN(in_dim, hidden_dim, out_dim, num_layers=2, dropout=0.0, init_type='none', last_sigmoid=False)[source]

Feed-forward network

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
dropout (float) – dropout rate
init_type (str) – the type of weight initialization
last_sigmoid (bool) – whether to apply sigmoid on the output

LSTMRNN

class nnsvs.model.LSTMRNN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, init_type='none')[source]

LSTM-based recurrent neural network

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
init_type (str) – the type of weight initialization

LSTMRNNSAR

class nnsvs.model.LSTMRNNSAR(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, stream_sizes=None, ar_orders=None, init_type='none')[source]

LSTM-RNN with shallow AR structure

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
stream_sizes (list) – Stream sizes
ar_orders (list) – Filter dimensions for each stream.
init_type (str) – the type of weight initialization

Conv1dResnet

class nnsvs.model.Conv1dResnet(in_dim, hidden_dim, out_dim, num_layers=4, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, **kwargs)[source]

Conv1d + Resnet

The model is inspired by the MelGAN’s model architecture (Kumar et al. [KKdB+19]). MDN layer is added if use_mdn is True.

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
init_type (str) – the type of weight initialization
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – the number of gaussians in MDN
dim_wise (bool) – whether to use dim-wise or not
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding

Conv1dResnetMDN

class nnsvs.model.Conv1dResnetMDN(in_dim, hidden_dim, out_dim, num_layers=4, num_gaussians=8, dim_wise=False, init_type='none', **kwargs)[source]: Conv1dResnet with MDN output layer

Conv1dResnetSAR

class nnsvs.model.Conv1dResnetSAR(in_dim, hidden_dim, out_dim, num_layers=4, stream_sizes=None, ar_orders=None, init_type='none', **kwargs)[source]

Conv1dResnet with shallow AR structure

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
stream_sizes (list) – Stream sizes
ar_orders (list) – Filter dimensions for each stream.
init_type (str) – the type of weight initialization

MDN

class nnsvs.model.MDN(in_dim, hidden_dim, out_dim, num_layers=1, num_gaussians=8, dim_wise=False, init_type='none', **kwargs)[source]

Mixture density networks (MDN) with FFN

Warning

It is recommended to use MDNv2 instead, unless you want to fine-turn from a old checkpoint of MDN.

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization

MDNv2

class nnsvs.model.MDNv2(in_dim, hidden_dim, out_dim, num_layers=1, dropout=0.5, num_gaussians=8, dim_wise=False, init_type='none')[source]

Mixture density networks (MDN) with FFN

MDN (v1) + Dropout

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
dropout (float) – dropout rate
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization

RMDN

class nnsvs.model.RMDN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, num_gaussians=8, dim_wise=False, init_type='none')[source]

RNN-based mixture density networks (MDN)

Parameters:

in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization

FFConvLSTM

class nnsvs.model.FFConvLSTM(in_dim, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, out_dim=67, dropout=0.0, num_lstm_layers=2, bidirectional=True, init_type='none', use_mdn=False, dim_wise=True, num_gaussians=4, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]

FFN + Conv1d + LSTM

A model proposed in Hono et al. [HHO+21] without residual F0 prediction.

Parameters:

in_dim (int) – the dimension of the input
ff_hidden_dim (int) – the dimension of the hidden state of the FFN
conv_hidden_dim (int) – the dimension of the hidden state of the conv1d
lstm_hidden_dim (int) – the dimension of the hidden state of the LSTM
out_dim (int) – the dimension of the output
dropout (float) – dropout rate
num_lstm_layers (int) – the number of layers of the LSTM
bidirectional (bool) – whether to use bidirectional LSTM
init_type (str) – the type of weight initialization
use_mdn (bool) – whether to use MDN or not
dim_wise (bool) – whether to use dimension-wise or not
num_gaussians (int) – the number of gaussians
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding

LSTMEncoder

class nnsvs.model.LSTMEncoder(in_dim: int, hidden_dim: int, out_dim: int, num_layers: int = 1, bidirectional: bool = True, dropout: float = 0.0, init_type: str = 'none', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]

LSTM encoder

A simple LSTM-based encoder

Parameters:

in_dim (int) – the input dimension
hidden_dim (int) – the hidden dimension
out_dim (int) – the output dimension
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional or not
dropout (float) – the dropout rate
init_type (str) – the initialization type
in_ph_start_idx (int) – the start index of phonetic context in a hed file
in_ph_end_idx (int) – the end index of phonetic context in a hed file
embed_dim (int) – the embedding dimension

VariancePredictor

class nnsvs.model.VariancePredictor(in_dim, out_dim, num_layers=5, hidden_dim=256, kernel_size=5, dropout=0.5, init_type='none', use_mdn=False, num_gaussians=1, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, mask_indices=None)[source]

Variance predictor in Ren et al. [RHQ+21].

The model is composed of stacks of Conv1d + ReLU + LayerNorm layers. The model can be used for duration or pitch prediction.

Parameters:

in_dim (int) – the input dimension
out_dim (int) – the output dimension
num_layers (int) – the number of layers
hidden_dim (int) – the hidden dimension
kernel_size (int) – the kernel size
dropout (float) – the dropout rate
init_type (str) – the initialization type
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dim-wise or not
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding
mask_indices (list) – the input feature indices to be masked. e.g., specify pitch_idx to mask pitch features.

TransformerEncoder

class nnsvs.model.TransformerEncoder(in_dim, out_dim, hidden_dim, attention_dim, num_heads=2, num_layers=2, kernel_size=3, dropout=0.1, reduction_factor=1, init_type='none', downsample_by_conv=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None)[source]

Transformer encoder

Warning

So far this is not well tested. Maybe be removed in the future.

Parameters:

in_dim (int) – the input dimension
out_dim (int) – the output dimension
hidden_dim (int) – the hidden dimension
attention_dim (int) – the attention dimension
num_heads (int) – the number of heads
num_layers (int) – the number of layers
kernel_size (int) – the kernel size
dropout (float) – the dropout rate
reduction_factor (int) – the reduction factor
init_type (str) – the initialization type
downsample_by_conv (bool) – whether to use convolutional downsampling or not
in_ph_start_idx (int) – the start index of phonetic context in a hed file
in_ph_end_idx (int) – the end index of phonetic context in a hed file
embed_dim (int) – the embedding dimension