nnsvs.model
Generic models that can be used for time-lag/duration/acoustic models.
FFN
- class nnsvs.model.FFN(in_dim, hidden_dim, out_dim, num_layers=2, dropout=0.0, init_type='none', last_sigmoid=False)[source]
Feed-forward network
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
dropout (float) – dropout rate
init_type (str) – the type of weight initialization
last_sigmoid (bool) – whether to apply sigmoid on the output
LSTMRNN
- class nnsvs.model.LSTMRNN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, init_type='none')[source]
LSTM-based recurrent neural network
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
init_type (str) – the type of weight initialization
LSTMRNNSAR
- class nnsvs.model.LSTMRNNSAR(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, stream_sizes=None, ar_orders=None, init_type='none')[source]
LSTM-RNN with shallow AR structure
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
stream_sizes (list) – Stream sizes
ar_orders (list) – Filter dimensions for each stream.
init_type (str) – the type of weight initialization
Conv1dResnet
- class nnsvs.model.Conv1dResnet(in_dim, hidden_dim, out_dim, num_layers=4, init_type='none', use_mdn=False, num_gaussians=8, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, **kwargs)[source]
Conv1d + Resnet
The model is inspired by the MelGAN’s model architecture (Kumar et al. [KKdB+19]). MDN layer is added if use_mdn is True.
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
init_type (str) – the type of weight initialization
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – the number of gaussians in MDN
dim_wise (bool) – whether to use dim-wise or not
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding
Conv1dResnetMDN
Conv1dResnetSAR
- class nnsvs.model.Conv1dResnetSAR(in_dim, hidden_dim, out_dim, num_layers=4, stream_sizes=None, ar_orders=None, init_type='none', **kwargs)[source]
Conv1dResnet with shallow AR structure
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
stream_sizes (list) – Stream sizes
ar_orders (list) – Filter dimensions for each stream.
init_type (str) – the type of weight initialization
MDN
- class nnsvs.model.MDN(in_dim, hidden_dim, out_dim, num_layers=1, num_gaussians=8, dim_wise=False, init_type='none', **kwargs)[source]
Mixture density networks (MDN) with FFN
Warning
It is recommended to use MDNv2 instead, unless you want to fine-turn from a old checkpoint of MDN.
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization
MDNv2
- class nnsvs.model.MDNv2(in_dim, hidden_dim, out_dim, num_layers=1, dropout=0.5, num_gaussians=8, dim_wise=False, init_type='none')[source]
Mixture density networks (MDN) with FFN
MDN (v1) + Dropout
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
dropout (float) – dropout rate
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization
RMDN
- class nnsvs.model.RMDN(in_dim, hidden_dim, out_dim, num_layers=1, bidirectional=True, dropout=0.0, num_gaussians=8, dim_wise=False, init_type='none')[source]
RNN-based mixture density networks (MDN)
- Parameters:
in_dim (int) – the dimension of the input
hidden_dim (int) – the dimension of the hidden state
out_dim (int) – the dimension of the output
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional LSTM
dropout (float) – dropout rate
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dimension-wise or not
init_type (str) – the type of weight initialization
FFConvLSTM
- class nnsvs.model.FFConvLSTM(in_dim, ff_hidden_dim=2048, conv_hidden_dim=1024, lstm_hidden_dim=256, out_dim=67, dropout=0.0, num_lstm_layers=2, bidirectional=True, init_type='none', use_mdn=False, dim_wise=True, num_gaussians=4, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]
FFN + Conv1d + LSTM
A model proposed in Hono et al. [HHO+21] without residual F0 prediction.
- Parameters:
in_dim (int) – the dimension of the input
ff_hidden_dim (int) – the dimension of the hidden state of the FFN
conv_hidden_dim (int) – the dimension of the hidden state of the conv1d
lstm_hidden_dim (int) – the dimension of the hidden state of the LSTM
out_dim (int) – the dimension of the output
dropout (float) – dropout rate
num_lstm_layers (int) – the number of layers of the LSTM
bidirectional (bool) – whether to use bidirectional LSTM
init_type (str) – the type of weight initialization
use_mdn (bool) – whether to use MDN or not
dim_wise (bool) – whether to use dimension-wise or not
num_gaussians (int) – the number of gaussians
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding
LSTMEncoder
- class nnsvs.model.LSTMEncoder(in_dim: int, hidden_dim: int, out_dim: int, num_layers: int = 1, bidirectional: bool = True, dropout: float = 0.0, init_type: str = 'none', in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, enforce_sorted=True)[source]
LSTM encoder
A simple LSTM-based encoder
- Parameters:
in_dim (int) – the input dimension
hidden_dim (int) – the hidden dimension
out_dim (int) – the output dimension
num_layers (int) – the number of layers
bidirectional (bool) – whether to use bidirectional or not
dropout (float) – the dropout rate
init_type (str) – the initialization type
in_ph_start_idx (int) – the start index of phonetic context in a hed file
in_ph_end_idx (int) – the end index of phonetic context in a hed file
embed_dim (int) – the embedding dimension
VariancePredictor
- class nnsvs.model.VariancePredictor(in_dim, out_dim, num_layers=5, hidden_dim=256, kernel_size=5, dropout=0.5, init_type='none', use_mdn=False, num_gaussians=1, dim_wise=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None, mask_indices=None)[source]
Variance predictor in Ren et al. [RHQ+21].
The model is composed of stacks of Conv1d + ReLU + LayerNorm layers. The model can be used for duration or pitch prediction.
- Parameters:
in_dim (int) – the input dimension
out_dim (int) – the output dimension
num_layers (int) – the number of layers
hidden_dim (int) – the hidden dimension
kernel_size (int) – the kernel size
dropout (float) – the dropout rate
init_type (str) – the initialization type
use_mdn (bool) – whether to use MDN or not
num_gaussians (int) – the number of gaussians
dim_wise (bool) – whether to use dim-wise or not
in_ph_start_idx (int) – the start index of phoneme identity in a hed file
in_ph_end_idx (int) – the end index of phoneme identity in a hed file
embed_dim (int) – the dimension of the phoneme embedding
mask_indices (list) – the input feature indices to be masked. e.g., specify pitch_idx to mask pitch features.
TransformerEncoder
- class nnsvs.model.TransformerEncoder(in_dim, out_dim, hidden_dim, attention_dim, num_heads=2, num_layers=2, kernel_size=3, dropout=0.1, reduction_factor=1, init_type='none', downsample_by_conv=False, in_ph_start_idx: int = 1, in_ph_end_idx: int = 50, embed_dim=None)[source]
Transformer encoder
Warning
So far this is not well tested. Maybe be removed in the future.
- Parameters:
in_dim (int) – the input dimension
out_dim (int) – the output dimension
hidden_dim (int) – the hidden dimension
attention_dim (int) – the attention dimension
num_heads (int) – the number of heads
num_layers (int) – the number of layers
kernel_size (int) – the kernel size
dropout (float) – the dropout rate
reduction_factor (int) – the reduction factor
init_type (str) – the initialization type
downsample_by_conv (bool) – whether to use convolutional downsampling or not
in_ph_start_idx (int) – the start index of phonetic context in a hed file
in_ph_end_idx (int) – the end index of phonetic context in a hed file
embed_dim (int) – the embedding dimension