Getting started with recipes
This page describes how to use a recipe to create a singing voice synthesis (SVS) system.
What is a recipe?
A recipe is a set of scripts and configuraitons to create SVS systems. A recipe describes all the necessary steps including data preprocessing, training, and synthesis.
Recipes have been adopted for reproducibility in several research projects such as Kaldi and ESPnet. NNSVS follows a similar approach [1].
Structure of a recipe
All the recipes are stored in the recipes
directory. Recipes are usually made per database. e.g., recipes/nit-song070
contains recipes for nit-song070 database.
There are three important parts for every recipe:
run.sh
: The entry point of the recipe.config.yaml
: A YAML based config file for recipe-specific configurationsconf
: A directory that contains detailed configurations for each model. YAML-based configuration files for time-lag/duration/acoustic models are stored in this directory.
An example of conf
directory is shown below. You can find model-specific configurations.
conf
├── train
│ ├── duration
│ │ ├── data
│ │ │ └── myconfig.yaml
│ │ ├── model
│ │ │ └── duration_mdn.yaml
│ │ └── train
│ │ └── myconfig.yaml
│ └── timelag
│ ├── data
│ │ └── myconfig.yaml
│ ├── model
│ │ └── timelag_mdn.yaml
│ └── train
│ └── myconfig.yaml
└── train_acoustic
└── acoustic
├── data
│ └── myconfig.yaml
├── model
│ └── acoustic_resf0convlstm.yaml
└── train
└── myconfig.yaml
Note
The contents of the rest of this page is based on recipes/conf/spsvs/run_common_steps_stable.sh
.
How to run a recipe
A basic workflow to run a recipe is to run the run.sh
from the command-line as follows:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 0 --stop-stage 99
The CUDA_VISIBLE_DEVICES=0
denotes that the script uses the 0-th GPU. If you have only one GPU, you can omit the CUDA_VISIBLE_DEVICES
.
The last --stage 0 --stop-stage 99
denotes that the script runs from stage 0 to stage 99.
Stage -1 is reserved for optional data downloading step. Some databases require you to sign a contract in advance. In that case, you need to manually download database.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage -1 --stop-stage -1
To understand what’s going on when running recipes, it is strongly recommended to run the recipe step-by-step.
Note
If you are new to recipes, please start with a test recipe nnsvs/recipes/nit-song070/dev-test
.
Note that every item in config.yaml
can be customized by the command-line interface.
For example, the following command
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 0 --stop-stage 99 \
--timelag_model timelag_mdn
is equivalent to manually changing the config.yaml
for the nnsvs/recipes/nit-song070/dev-test
recipe as follows:
-timelag_model: timelag_test
+timelag_model: timelag_mdn
Recipes can be arbitrary configured depending on your purpose, but the followings are some common steps for recipes.
Stage 0: Data preparation
Stage 0 for the most recipes does the following three things:
Convert MusicXML or UST to HTS-style full-context labels.
Segment singing data into small segments.
Split the data into train/dev/test sets.
The second step is optional but is helpful to avoid GPU out-of-memory errors.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 0 --stop-stage 0
Stage 1: Feature generation
This step performs all the feature extraction steps needed to train time-lag/duration/acoustic models. HTS-style full-context label files and wav files are processed together to prepare inputs/outputs for neural networks.
Note that errors will happen when your wav files and label files are not aligned correctly.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 1 --stop-stage 1
After running the step, you can find extracted features in dump
directory.
$ tree -L 4 dump/
dump/
└── yoko
├── norm
│ ├── dev
│ │ ├── in_acoustic
│ │ ├── in_duration
│ │ ├── in_timelag
│ │ ├── out_acoustic
│ │ ├── out_duration
│ │ └── out_timelag
│ ├── eval
│ │ ├── in_acoustic
│ │ ├── in_duration
│ │ ├── in_timelag
│ │ ├── out_acoustic
│ │ ├── out_duration
│ │ └── out_timelag
│ ├── in_acoustic_scaler.joblib
│ ├── in_duration_scaler.joblib
│ ├── in_timelag_scaler.joblib
│ ├── out_acoustic_scaler.joblib
│ ├── out_duration_scaler.joblib
│ ├── out_timelag_scaler.joblib
│ └── train_no_dev
│ ├── in_acoustic
│ ├── in_duration
│ ├── in_timelag
│ ├── out_acoustic
│ ├── out_duration
│ └── out_timelag
└── org
...
Some notes:
norm
andorg
directories contain normalized and unnormalized features. Normalized features are used for training neural networks.*_scaler.joblib
files are used to normalize/de-normalize features and contain statistics of the training data (e.g., mean and varaince). The file format follows joblib.in_*
andout_*
directories contain input and output features.
All the features are saved in numpy format. You can inspect features by a simple python script like:
import numpy as np
feats = np.load("path/to/your/features.npy")
Stage 2: Train time-lag model
Once the feature generation is completed, you are ready to train neural networks.
You can train a time-lag model by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 2
Or, you may want to explicltly specify a model by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 2 \
--timelag-model timelag_test
You can find available model configs in conf/train/timelag/model
directory, or you can create your own model config.
After training is finished, you can find model checkpoints in exp
directory.
exp/yoko/timelag_test/
├── best_loss.pth
├── config.yaml
├── epoch0002.pth
├── latest.pth
└── model.yaml
Some notes:
*.pth
files are the model checkpoints where the parameters of neural networks are stored.*.yaml
are the configuration files.model.yaml
is a model-specific config. This file can be used to instantiate a model by hydra.config.yaml
contains all the training details.best_loss.pth
is the checkpoint when the model hit the best development loss.latest.pth
is the latest checkpoint.epoch*.pth
are intermediate checkpoints at a specific epoch.
Stage 3: Train duration model
Similarly, you can train a duration model by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 3 --stop-stage 3
You can explicltly specify a model type by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 3 --stop-stage 3 \
--duration-model duration_test
You can find available model configs in conf/train/duration/model
.
After training is finished, you can find model checkpoints in exp
directory.
exp/yoko/duration_test/
├── best_loss.pth
├── config.yaml
├── epoch0002.pth
├── latest.pth
└── model.yaml
Stage 4: Train acoustic model
The acoustic model is the most important part of the SVS system. You are likely to run this step multiple times until you get a good model. You can train an acoustic model by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 4 --stop-stage 4
You can explicltly specify a model type by:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 4 --stop-stage 4 \
--acoustic-model acoustic_test
You can find available model configs in conf/train_acoustic/acoustic/model
, or you can create your own model config.
Note
Training aoustic models requires several hours or a whole day depending on training configurations. During training, it is useful to monitor training progress using Tensorboard. See Tips for more details.
After training is finished, you can find model checkpoints in exp
directory.
exp/yoko/acoustic_test/
├── best_loss.pth
├── config.yaml
├── epoch0002.pth
├── latest.pth
└── model.yaml
Stage 5: Generate features
One you have trained all the models, you can genearte features by your models. You can ignore this step if you want to listen to audio samples rather than inspecting intermedieate features.
If you use your custom model types at the training steps, you must specify these models at this step.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 5 --stop-stage 5 \
--timelag-model timelag_test \
--duration-model duration_test \
--acoustic-model acoustic_test
Stage 6: Synthesis waveforms
Stage 6 generates waveforms using the trained models. To run this step, please make sure to specify your model types when you train custom models.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 5 --stop-stage 5 \
--timelag-model timelag_test \
--duration-model duration_test \
--acoustic-model acoustic_test
You can find generated wav files in exp/${speaker name}/synthesis_*
directory.
Packing models
As explained in the Overview of NNSVS’s SVS, NNSVS’s SVS system is composed of multiple modules. NNSVS provides functionality to pack the multiple models into a single directory, which can then be shared/used easily.
Recipes have special step at 99 for the model packaging purpose.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99
Note that you must specify model types if you use custom models. e.g.,
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \
--timelag-model timelag_test \
--duration-model duration_test \
--acoustic-model acoustic_test
After running the command above, you can find a packed model in the packed_model
directory.
A packed model directory will have the following files. Note that *postfilter_*
and *vocoder_*
files are optional.
$ ls -1
acoustic_model.pth
acoustic_model.yaml
config.yaml
duration_model.pth
duration_model.yaml
in_acoustic_scaler_min.npy
in_acoustic_scaler_scale.npy
in_duration_scaler_min.npy
in_duration_scaler_scale.npy
in_timelag_scaler_min.npy
in_timelag_scaler_scale.npy
in_vocoder_scaler_mean.npy
in_vocoder_scaler_scale.npy
in_vocoder_scaler_var.npy
out_acoustic_scaler_mean.npy
out_acoustic_scaler_scale.npy
out_acoustic_scaler_var.npy
out_duration_scaler_mean.npy
out_duration_scaler_scale.npy
out_duration_scaler_var.npy
out_postfilter_scaler_mean.npy
out_postfilter_scaler_scale.npy
out_postfilter_scaler_var.npy
out_timelag_scaler_mean.npy
out_timelag_scaler_scale.npy
out_timelag_scaler_var.npy
postfilter_model.pth
postfilter_model.yaml
qst.hed
timelag_model.pth
timelag_model.yaml
vocoder_model.pth
vocoder_model.yaml
Some notes:
*.pth
files contain parameters of neural networks.*_model.yaml
files contain definitions of neural networks such as the name of the PyTorch model (e.g.,nnsvs.model.MDN
), number of layers, number of hidden units, etc.*.npy
files contain parameters of scikit-learn’s scalers that are used to normalize/denormalize features.qst.hed
is the HED file used for training models.config.yaml
is the global config file. It specifies sampling rate for example.
Once the packaging step is done, you can use the packaged model by the nnsvs.svs module. An example of using packed models can be found at NNSVS demos.
With the packed model, you can easily generate singing voice by inputting MusicXML or UST files.
Customizing recipes
Not just running existing recipes, you may want to make your own ones. e.g., adding your custom models, customizing steps, using your own data, etc.
If you want to make your own recipe, the easiest way is to copy an existing recipe and modify it accordingly. Please check one of recipes in the NNSVS repostiry and start modifying part of them.