How to train post-filters

Please check Getting started with recipes and Overview of NNSVS’s SVS first.

NNSVS v0.0.3 and later supports an optional trainable post-filter to enhance the acoustic model’s prediction. This page summarizes how to train post-filters.

Warning

As of 2022/10/15, I concluded that GV-post filter works better than trainalbe post-filters in most cases. Please consider using GV-post filter insteaad.

Note

The contents in this page is based on recipes/conf/spsvs/run_common_steps_dev.sh. Also, before you make your custom recipes, it is recommenced to start with a test recipe recipes/nit-song070/dev-test.

Pre-requisites

Input/output of a post-filter

Input and output of a post-filter are as follows:

Input: acoustic features predicted by an acoustic model
Output: enhanced acoustic features

Note that post-filters do not use delta and delta-delta features. If your acoustic model’s output contains delta and delta-delta features, the parameter generation algorithm (a.k.a. MLPG) is performed to prepare input/output features for post-filters.

You must train an acoustic model first

You must train an acoustic model first since the input of a post-filter depends the output of an acoustic model. Furthermore, please be aware that you need to re-train a post-filter whenever you re-train your acoustic model. Therefore, it is highly recommended to train a good acoustic model before training a post-filter.

Train a good acoustic model first

It is better to train a good acoustic model. This is because the post-filter is trained on the features predicted by the acoustic model. If the acoustic model’s prediction is not accurate enough, the post-filter is likely to have a bad performance.

In addition to the steps described in Getting started with recipes, the following are the steps related to post-filters.

Stage 7: Prepare features for post-filter

Once your acoustic model is ready, you can run the stage 7 to prepare input and output features for training post-filters.

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 7 --stop-stage 7 \
    --acoustic-model acoustic_test

After running the above command, you can find the input features for post-filters in the acoustic model’s checkpoint directory:

$ tree -L 3 exp/yoko/acoustic_test/

exp/yoko/acoustic_test/
├── best_loss.pth
├── config.yaml
├── epoch0002.pth
├── latest.pth
├── model.yaml
├── norm
│   ├── dev
│   │   └── in_postfilter
│   ├── eval
│   │   └── in_postfilter
│   ├── in_postfilter_scaler.joblib
│   └── train_no_dev
│       └── in_postfilter
└── predicted
    └── eval
        └── latest

Some notes:

norm/*/in_postfilter directory contains the input features for post-filters.
norm/in_postfilter_scaler.joblib contains the scaler used to normalize/de-normalize the input features for post-filters.

As for the output features, you can find them in the dump directory.

$ tree -L 4 dump/

dump/
└── yoko
    ├── norm
    │   ├── dev
    │   │   ├── in_acoustic
    │   │   ├── in_duration
    │   │   ├── in_timelag
    │   │   ├── in_vocoder
    │   │   ├── out_acoustic
    │   │   ├── out_duration
    │   │   ├── out_postfilter
    │   │   └── out_timelag
    │   ├── eval
    │   │   ├── in_acoustic
    │   │   ├── in_duration
    │   │   ├── in_timelag
    │   │   ├── in_vocoder
    │   │   ├── out_acoustic
    │   │   ├── out_duration
    │   │   ├── out_postfilter
    │   │   └── out_timelag
    │   ├── in_acoustic_scaler.joblib
    │   ├── in_duration_scaler.joblib
    │   ├── in_timelag_scaler.joblib
    │   ├── in_vocoder_scaler_mean.npy
    │   ├── in_vocoder_scaler_scale.npy
    │   ├── in_vocoder_scaler_var.npy
    │   ├── out_acoustic_scaler.joblib
    │   ├── out_duration_scaler.joblib
    │   ├── out_postfilter_scaler.joblib
    │   ├── out_timelag_scaler.joblib
    │   └── train_no_dev
    │       ├── in_acoustic
    │       ├── in_duration
    │       ├── in_timelag
    │       ├── in_vocoder
    │       ├── out_acoustic
    │       ├── out_duration
    │       ├── out_postfilter
    │       └── out_timelag
    └── org

Some notes:

dump/*/norm/*/out_postfilter directory contains the output features for post-filters. Again, remember that these features don’t contain delta and delta-delta features.
dump/*/norm/out_postfilter_scaler.joblib contains the scaler used to normalize/de-normalize the output features for post-filters.

Stage 8: Train post-filters

Once you generated input/output features, you are ready to train post-filters. The current NNSVS’s post-filter is based on generative adversarial networks (GANs). So you need to train generator and discrimiantor together.

There are number of different ways to train post-filters by NNSVS. However, the following is the recommended way to get the best performance (based on r9y9’s experience):

Train a post-filter only for mgc
Train a post-filter only for bap
Merge the two post-filters into one post-filter

Pre-tuned config files are stored in recipes/_common/jp_dev_latest/conf/train_postfilter.

Train post-filter for `mgc`

To train a post-filter for mgc, you can run the following command:

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
    --acoustic-model acoustic_test \
    --postfilter-model postfilter_mgc_test \
    --postfilter-train mgc

Note that you must specify --postfilter-train mgc. This tells the training script to only use the mgc feature stream. Other streams such as lf0 and bap are ignored.

Warning

Training a post-filter for mgc requires larger amount of GPU VRAM than the normal acoustic model training at the moment. Try using a smaller batch size.

Once the training is finished, you can find model checkpoints in the exp directory:

$ tree exp/yoko/postfilter_mgc_test

exp/yoko/postfilter_mgc_test
├── best_loss.pth
├── best_loss_D.pth
├── config.yaml
├── epoch0002.pth
├── epoch0002_D.pth
├── latest.pth
├── latest_D.pth
└── model.yaml

Some notes:

*_D.pth is the model checkpoint for the discriminator. D stands for discriminators.
model.yaml includes configs for both generator and discrimiantor.

Train post-filter for `bap`

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
    --acoustic-model acoustic_test \
    --postfilter-model postfilter_bap \
    --postfilter-train bap

Note that you must specify --postfilter-train bap. This tells the training script to only use the bap feature stream.

Merge the two post-filters

This step is not included in the recipe. So you need to manually run the following command to merge the two post-filters:

python ../../../utils/merge_postfilters.py exp/yoko/postfilter_mgc_test/latest.pth \
    exp/yoko/postfilter_bap_test/latest.pth \
    exp/yoko/postfilter_merged

Then, you can see the merged post-filter in the exp/yoko/postfilter_merged directory.

$ tree exp/yoko/postfilter_merged/

exp/yoko/postfilter_merged/
├── latest.pth
└── model.yaml

Packing models with post-filter

As the same as in Getting started with recipes, you can pack the models into a single directory by running stage 99. Please make sure to specify the merged post-filter like:

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \
    --timelag-model timelag_test \
    --duration-model duration_test \
    --acoustic-model acoustic_test \
    --postfilter-model postfilter_merged

The above command should make a packed model directory with your trained post-filter.

How to use the packed model with the trained post-filter?

Please specify post_filter_type="nnsvs" with the nnsvs.svs module. An example:

import numpy as np
import pysinsy
from nnmnkwii.io import hts
from nnsvs.pretrained import retrieve_pretrained_model
from nnsvs.svs import SPSVS
from nnsvs.util import example_xml_file

model_dir = "/path/to/your/packed/model_dir"
engine = SPSVS(model_dir)

contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over"))
labels = hts.HTSLabelFile.create_from_contexts(contexts)

wav, sr = engine.svs(labels, post_filter_type="nnsvs")

Tips for training post-filters

If you look into the post-filter configs, you will find many parameters. Here are the tips if you want to turn by yourself:

Train configs

fm_weight: The weight of the feature matching loss. By increasing the weight, you may get more stable results with a possible loss of naturalness. By setting fm_weight to zero, training will get unstable.
adv_weight: The weight of the adversarial loss. By increasing the weight, you may get better naturalness.
mse_weight: The weight of the MSE loss. If you set non-zero value, you will get smoother output features.

Model configs

smoothed_width: The width of the smoothing window. If you set non-zero value, you will get smoother outputs. This is useful to reduce audible artifacts. Only used for inference.

Details of post-filter implementation

You don’t need to understand the details if you just want to try, but please look into Kaneko et al. [KTKY17b], Kaneko et al. [KTKY17a] if you are interested.

How to train post-filters

Pre-requisites

Input/output of a post-filter

You must train an acoustic model first

Train a good acoustic model first

Stage 7: Prepare features for post-filter

Stage 8: Train post-filters

Train post-filter for mgc

Train post-filter for bap

Merge the two post-filters

Packing models with post-filter

How to use the packed model with the trained post-filter?

Tips for training post-filters

Train configs

Model configs

Details of post-filter implementation

Train post-filter for `mgc`

Train post-filter for `bap`