How to train post-filters
NNSVS v0.0.3 and later supports an optional trainable post-filter to enhance the acoustic model’s prediction. This page summarizes how to train post-filters.
As of 2022/10/15, I concluded that GV-post filter works better than trainalbe post-filters in most cases. Please consider using GV-post filter insteaad.
The contents in this page is based on
Also, before you make your custom recipes, it is recommenced to start with a test recipe
Input/output of a post-filter
Input and output of a post-filter are as follows:
Input: acoustic features predicted by an acoustic model
Output: enhanced acoustic features
Note that post-filters do not use delta and delta-delta features. If your acoustic model’s output contains delta and delta-delta features, the parameter generation algorithm (a.k.a. MLPG) is performed to prepare input/output features for post-filters.
You must train an acoustic model first
You must train an acoustic model first since the input of a post-filter depends the output of an acoustic model. Furthermore, please be aware that you need to re-train a post-filter whenever you re-train your acoustic model. Therefore, it is highly recommended to train a good acoustic model before training a post-filter.
Train a good acoustic model first
It is better to train a good acoustic model. This is because the post-filter is trained on the features predicted by the acoustic model. If the acoustic model’s prediction is not accurate enough, the post-filter is likely to have a bad performance.
In addition to the steps described in Getting started with recipes, the following are the steps related to post-filters.
Stage 7: Prepare features for post-filter
Once your acoustic model is ready, you can run the stage 7 to prepare input and output features for training post-filters.
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 7 --stop-stage 7 \ --acoustic-model acoustic_test
After running the above command, you can find the input features for post-filters in the acoustic model’s checkpoint directory:
$ tree -L 3 exp/yoko/acoustic_test/ exp/yoko/acoustic_test/ ├── best_loss.pth ├── config.yaml ├── epoch0002.pth ├── latest.pth ├── model.yaml ├── norm │ ├── dev │ │ └── in_postfilter │ ├── eval │ │ └── in_postfilter │ ├── in_postfilter_scaler.joblib │ └── train_no_dev │ └── in_postfilter └── predicted └── eval └── latest
norm/*/in_postfilterdirectory contains the input features for post-filters.
norm/in_postfilter_scaler.joblibcontains the scaler used to normalize/de-normalize the input features for post-filters.
As for the output features, you can find them in the
$ tree -L 4 dump/ dump/ └── yoko ├── norm │ ├── dev │ │ ├── in_acoustic │ │ ├── in_duration │ │ ├── in_timelag │ │ ├── in_vocoder │ │ ├── out_acoustic │ │ ├── out_duration │ │ ├── out_postfilter │ │ └── out_timelag │ ├── eval │ │ ├── in_acoustic │ │ ├── in_duration │ │ ├── in_timelag │ │ ├── in_vocoder │ │ ├── out_acoustic │ │ ├── out_duration │ │ ├── out_postfilter │ │ └── out_timelag │ ├── in_acoustic_scaler.joblib │ ├── in_duration_scaler.joblib │ ├── in_timelag_scaler.joblib │ ├── in_vocoder_scaler_mean.npy │ ├── in_vocoder_scaler_scale.npy │ ├── in_vocoder_scaler_var.npy │ ├── out_acoustic_scaler.joblib │ ├── out_duration_scaler.joblib │ ├── out_postfilter_scaler.joblib │ ├── out_timelag_scaler.joblib │ └── train_no_dev │ ├── in_acoustic │ ├── in_duration │ ├── in_timelag │ ├── in_vocoder │ ├── out_acoustic │ ├── out_duration │ ├── out_postfilter │ └── out_timelag └── org
dump/*/norm/*/out_postfilterdirectory contains the output features for post-filters. Again, remember that these features don’t contain delta and delta-delta features.
dump/*/norm/out_postfilter_scaler.joblibcontains the scaler used to normalize/de-normalize the output features for post-filters.
Stage 8: Train post-filters
Once you generated input/output features, you are ready to train post-filters. The current NNSVS’s post-filter is based on generative adversarial networks (GANs). So you need to train generator and discrimiantor together.
There are number of different ways to train post-filters by NNSVS. However, the following is the recommended way to get the best performance (based on r9y9’s experience):
Train a post-filter only for
Train a post-filter only for
Merge the two post-filters into one post-filter
Pre-tuned config files are stored in
Train post-filter for
To train a post-filter for
mgc, you can run the following command:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \ --acoustic-model acoustic_test \ --postfilter-model postfilter_mgc_test \ --postfilter-train mgc
Note that you must specify
--postfilter-train mgc. This tells the training script to only use the
mgc feature stream. Other streams such as
bap are ignored.
Training a post-filter for
mgc requires larger amount of GPU VRAM than the normal acoustic model training at the moment. Try using a smaller batch size.
Once the training is finished, you can find model checkpoints in the
$ tree exp/yoko/postfilter_mgc_test exp/yoko/postfilter_mgc_test ├── best_loss.pth ├── best_loss_D.pth ├── config.yaml ├── epoch0002.pth ├── epoch0002_D.pth ├── latest.pth ├── latest_D.pth └── model.yaml
*_D.pthis the model checkpoint for the discriminator. D stands for discriminators.
model.yamlincludes configs for both generator and discrimiantor.
Train post-filter for
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \ --acoustic-model acoustic_test \ --postfilter-model postfilter_bap \ --postfilter-train bap
Note that you must specify
--postfilter-train bap. This tells the training script to only use the
bap feature stream.
Merge the two post-filters
This step is not included in the recipe. So you need to manually run the following command to merge the two post-filters:
python ../../../utils/merge_postfilters.py exp/yoko/postfilter_mgc_test/latest.pth \ exp/yoko/postfilter_bap_test/latest.pth \ exp/yoko/postfilter_merged
Then, you can see the merged post-filter in the
$ tree exp/yoko/postfilter_merged/ exp/yoko/postfilter_merged/ ├── latest.pth └── model.yaml
Packing models with post-filter
As the same as in Getting started with recipes, you can pack the models into a single directory by running stage 99. Please make sure to specify the merged post-filter like:
CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \ --timelag-model timelag_test \ --duration-model duration_test \ --acoustic-model acoustic_test \ --postfilter-model postfilter_merged
The above command should make a packed model directory with your trained post-filter.
How to use the packed model with the trained post-filter?
post_filter_type="nnsvs" with the nnsvs.svs module. An example:
import numpy as np import pysinsy from nnmnkwii.io import hts from nnsvs.pretrained import retrieve_pretrained_model from nnsvs.svs import SPSVS from nnsvs.util import example_xml_file model_dir = "/path/to/your/packed/model_dir" engine = SPSVS(model_dir) contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over")) labels = hts.HTSLabelFile.create_from_contexts(contexts) wav, sr = engine.svs(labels, post_filter_type="nnsvs")
Tips for training post-filters
If you look into the post-filter configs, you will find many parameters. Here are the tips if you want to turn by yourself:
fm_weight: The weight of the feature matching loss. By increasing the weight, you may get more stable results with a possible loss of naturalness. By setting
fm_weightto zero, training will get unstable.
adv_weight: The weight of the adversarial loss. By increasing the weight, you may get better naturalness.
mse_weight: The weight of the MSE loss. If you set non-zero value, you will get smoother output features.
smoothed_width: The width of the smoothing window. If you set non-zero value, you will get smoother outputs. This is useful to reduce audible artifacts. Only used for inference.