How to train post-filters
Please check Getting started with recipes and Overview of NNSVS’s SVS first.
NNSVS v0.0.3 and later supports an optional trainable post-filter to enhance the acoustic model’s prediction. This page summarizes how to train post-filters.
Warning
As of 2022/10/15, I concluded that GV-post filter works better than trainalbe post-filters in most cases. Please consider using GV-post filter insteaad.
Note
The contents in this page is based on recipes/conf/spsvs/run_common_steps_dev.sh
.
Also, before you make your custom recipes, it is recommenced to start with a test recipe recipes/nit-song070/dev-test
.
Pre-requisites
Input/output of a post-filter
Input and output of a post-filter are as follows:
Input: acoustic features predicted by an acoustic model
Output: enhanced acoustic features
Note that post-filters do not use delta and delta-delta features. If your acoustic model’s output contains delta and delta-delta features, the parameter generation algorithm (a.k.a. MLPG) is performed to prepare input/output features for post-filters.
You must train an acoustic model first
You must train an acoustic model first since the input of a post-filter depends the output of an acoustic model. Furthermore, please be aware that you need to re-train a post-filter whenever you re-train your acoustic model. Therefore, it is highly recommended to train a good acoustic model before training a post-filter.
Train a good acoustic model first
It is better to train a good acoustic model. This is because the post-filter is trained on the features predicted by the acoustic model. If the acoustic model’s prediction is not accurate enough, the post-filter is likely to have a bad performance.
In addition to the steps described in Getting started with recipes, the following are the steps related to post-filters.
Stage 7: Prepare features for post-filter
Once your acoustic model is ready, you can run the stage 7 to prepare input and output features for training post-filters.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 7 --stop-stage 7 \
--acoustic-model acoustic_test
After running the above command, you can find the input features for post-filters in the acoustic model’s checkpoint directory:
$ tree -L 3 exp/yoko/acoustic_test/
exp/yoko/acoustic_test/
├── best_loss.pth
├── config.yaml
├── epoch0002.pth
├── latest.pth
├── model.yaml
├── norm
│ ├── dev
│ │ └── in_postfilter
│ ├── eval
│ │ └── in_postfilter
│ ├── in_postfilter_scaler.joblib
│ └── train_no_dev
│ └── in_postfilter
└── predicted
└── eval
└── latest
Some notes:
norm/*/in_postfilter
directory contains the input features for post-filters.norm/in_postfilter_scaler.joblib
contains the scaler used to normalize/de-normalize the input features for post-filters.
As for the output features, you can find them in the dump
directory.
$ tree -L 4 dump/
dump/
└── yoko
├── norm
│ ├── dev
│ │ ├── in_acoustic
│ │ ├── in_duration
│ │ ├── in_timelag
│ │ ├── in_vocoder
│ │ ├── out_acoustic
│ │ ├── out_duration
│ │ ├── out_postfilter
│ │ └── out_timelag
│ ├── eval
│ │ ├── in_acoustic
│ │ ├── in_duration
│ │ ├── in_timelag
│ │ ├── in_vocoder
│ │ ├── out_acoustic
│ │ ├── out_duration
│ │ ├── out_postfilter
│ │ └── out_timelag
│ ├── in_acoustic_scaler.joblib
│ ├── in_duration_scaler.joblib
│ ├── in_timelag_scaler.joblib
│ ├── in_vocoder_scaler_mean.npy
│ ├── in_vocoder_scaler_scale.npy
│ ├── in_vocoder_scaler_var.npy
│ ├── out_acoustic_scaler.joblib
│ ├── out_duration_scaler.joblib
│ ├── out_postfilter_scaler.joblib
│ ├── out_timelag_scaler.joblib
│ └── train_no_dev
│ ├── in_acoustic
│ ├── in_duration
│ ├── in_timelag
│ ├── in_vocoder
│ ├── out_acoustic
│ ├── out_duration
│ ├── out_postfilter
│ └── out_timelag
└── org
Some notes:
dump/*/norm/*/out_postfilter
directory contains the output features for post-filters. Again, remember that these features don’t contain delta and delta-delta features.dump/*/norm/out_postfilter_scaler.joblib
contains the scaler used to normalize/de-normalize the output features for post-filters.
Stage 8: Train post-filters
Once you generated input/output features, you are ready to train post-filters. The current NNSVS’s post-filter is based on generative adversarial networks (GANs). So you need to train generator and discrimiantor together.
There are number of different ways to train post-filters by NNSVS. However, the following is the recommended way to get the best performance (based on r9y9’s experience):
Train a post-filter only for
mgc
Train a post-filter only for
bap
Merge the two post-filters into one post-filter
Pre-tuned config files are stored in recipes/_common/jp_dev_latest/conf/train_postfilter
.
Train post-filter for mgc
To train a post-filter for mgc
, you can run the following command:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
--acoustic-model acoustic_test \
--postfilter-model postfilter_mgc_test \
--postfilter-train mgc
Note that you must specify --postfilter-train mgc
. This tells the training script to only use the mgc
feature stream. Other streams such as lf0
and bap
are ignored.
Warning
Training a post-filter for mgc
requires larger amount of GPU VRAM than the normal acoustic model training at the moment. Try using a smaller batch size.
Once the training is finished, you can find model checkpoints in the exp
directory:
$ tree exp/yoko/postfilter_mgc_test
exp/yoko/postfilter_mgc_test
├── best_loss.pth
├── best_loss_D.pth
├── config.yaml
├── epoch0002.pth
├── epoch0002_D.pth
├── latest.pth
├── latest_D.pth
└── model.yaml
Some notes:
*_D.pth
is the model checkpoint for the discriminator. D stands for discriminators.model.yaml
includes configs for both generator and discrimiantor.
Train post-filter for bap
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
--acoustic-model acoustic_test \
--postfilter-model postfilter_bap \
--postfilter-train bap
Note that you must specify --postfilter-train bap
. This tells the training script to only use the bap
feature stream.
Merge the two post-filters
This step is not included in the recipe. So you need to manually run the following command to merge the two post-filters:
python ../../../utils/merge_postfilters.py exp/yoko/postfilter_mgc_test/latest.pth \
exp/yoko/postfilter_bap_test/latest.pth \
exp/yoko/postfilter_merged
Then, you can see the merged post-filter in the exp/yoko/postfilter_merged
directory.
$ tree exp/yoko/postfilter_merged/
exp/yoko/postfilter_merged/
├── latest.pth
└── model.yaml
Packing models with post-filter
As the same as in Getting started with recipes, you can pack the models into a single directory by running stage 99. Please make sure to specify the merged post-filter like:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \
--timelag-model timelag_test \
--duration-model duration_test \
--acoustic-model acoustic_test \
--postfilter-model postfilter_merged
The above command should make a packed model directory with your trained post-filter.
How to use the packed model with the trained post-filter?
Please specify post_filter_type="nnsvs"
with the nnsvs.svs module. An example:
import numpy as np
import pysinsy
from nnmnkwii.io import hts
from nnsvs.pretrained import retrieve_pretrained_model
from nnsvs.svs import SPSVS
from nnsvs.util import example_xml_file
model_dir = "/path/to/your/packed/model_dir"
engine = SPSVS(model_dir)
contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over"))
labels = hts.HTSLabelFile.create_from_contexts(contexts)
wav, sr = engine.svs(labels, post_filter_type="nnsvs")
Tips for training post-filters
If you look into the post-filter configs, you will find many parameters. Here are the tips if you want to turn by yourself:
Train configs
fm_weight
: The weight of the feature matching loss. By increasing the weight, you may get more stable results with a possible loss of naturalness. By settingfm_weight
to zero, training will get unstable.adv_weight
: The weight of the adversarial loss. By increasing the weight, you may get better naturalness.mse_weight
: The weight of the MSE loss. If you set non-zero value, you will get smoother output features.
Model configs
smoothed_width
: The width of the smoothing window. If you set non-zero value, you will get smoother outputs. This is useful to reduce audible artifacts. Only used for inference.
Details of post-filter implementation
You don’t need to understand the details if you just want to try, but please look into Kaneko et al. [KTKY17b], Kaneko et al. [KTKY17a] if you are interested.