How to train post-filters
=========================

Please check :doc:`recipes` and :doc:`overview` first.

NNSVS v0.0.3 and later supports an optional trainable post-filter to enhance the acoustic model's prediction.
This page summarizes how to train post-filters.

.. warning::

    As of 2022/10/15, I concluded that GV-post filter works better than trainalbe post-filters in most cases.
    Please consider using GV-post filter insteaad.

.. note::

    The contents in this page is based on ``recipes/conf/spsvs/run_common_steps_dev.sh``.
    Also, before you make your custom recipes, it is recommenced to start with a test recipe ``recipes/nit-song070/dev-test``.

Pre-requisites
--------------

Input/output of a post-filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Input and output of a post-filter are as follows:

- Input: acoustic features predicted by an acoustic model
- Output: enhanced acoustic features

Note that post-filters do not use delta and delta-delta features.
If your acoustic model's output contains delta and delta-delta features, the parameter generation algorithm (a.k.a. MLPG) is performed to prepare input/output features for post-filters.

You must train an acoustic model first
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You must train an acoustic model first since the input of a post-filter depends the output of an acoustic model.
Furthermore, please be aware that you need to re-train a post-filter whenever you re-train your acoustic model.
Therefore, it is highly recommended to train a good acoustic model before training a post-filter.

Train a good acoustic model first
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is better to train a good acoustic model. This is because the post-filter is trained on the features predicted by the acoustic model.
If the acoustic model's prediction is not accurate enough, the post-filter is likely to have a bad performance.

In addition to the steps described in :doc:`recipes`,  the following are the steps related to post-filters.

Stage 7: Prepare features for post-filter
-----------------------------------------

Once your acoustic model is ready, you can run the stage 7 to prepare input and output features for training post-filters.


.. code-block:: bash

    CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 7 --stop-stage 7 \
        --acoustic-model acoustic_test

After running the above command, you can find the input features for post-filters in the acoustic model's checkpoint directory:

.. code-block::

    $ tree -L 3 exp/yoko/acoustic_test/

    exp/yoko/acoustic_test/
    ├── best_loss.pth
    ├── config.yaml
    ├── epoch0002.pth
    ├── latest.pth
    ├── model.yaml
    ├── norm
    │   ├── dev
    │   │   └── in_postfilter
    │   ├── eval
    │   │   └── in_postfilter
    │   ├── in_postfilter_scaler.joblib
    │   └── train_no_dev
    │       └── in_postfilter
    └── predicted
        └── eval
            └── latest

Some notes:

- ``norm/*/in_postfilter`` directory contains the input features for post-filters.
- ``norm/in_postfilter_scaler.joblib`` contains the scaler used to normalize/de-normalize the input features for post-filters.

As for the output features, you can find them in the ``dump`` directory.

.. code-block::

    $ tree -L 4 dump/

    dump/
    └── yoko
        ├── norm
        │   ├── dev
        │   │   ├── in_acoustic
        │   │   ├── in_duration
        │   │   ├── in_timelag
        │   │   ├── in_vocoder
        │   │   ├── out_acoustic
        │   │   ├── out_duration
        │   │   ├── out_postfilter
        │   │   └── out_timelag
        │   ├── eval
        │   │   ├── in_acoustic
        │   │   ├── in_duration
        │   │   ├── in_timelag
        │   │   ├── in_vocoder
        │   │   ├── out_acoustic
        │   │   ├── out_duration
        │   │   ├── out_postfilter
        │   │   └── out_timelag
        │   ├── in_acoustic_scaler.joblib
        │   ├── in_duration_scaler.joblib
        │   ├── in_timelag_scaler.joblib
        │   ├── in_vocoder_scaler_mean.npy
        │   ├── in_vocoder_scaler_scale.npy
        │   ├── in_vocoder_scaler_var.npy
        │   ├── out_acoustic_scaler.joblib
        │   ├── out_duration_scaler.joblib
        │   ├── out_postfilter_scaler.joblib
        │   ├── out_timelag_scaler.joblib
        │   └── train_no_dev
        │       ├── in_acoustic
        │       ├── in_duration
        │       ├── in_timelag
        │       ├── in_vocoder
        │       ├── out_acoustic
        │       ├── out_duration
        │       ├── out_postfilter
        │       └── out_timelag
        └── org


Some notes:

- ``dump/*/norm/*/out_postfilter`` directory contains the output features for post-filters. Again, remember that these features don't contain delta and delta-delta features.
- ``dump/*/norm/out_postfilter_scaler.joblib`` contains the scaler used to normalize/de-normalize the output features for post-filters.


Stage 8: Train post-filters
---------------------------

Once you generated input/output features, you are ready to train post-filters. The current NNSVS's post-filter is based on generative adversarial networks (GANs). So you need to train generator and discrimiantor together.

There are number of different ways to train post-filters by NNSVS. However, the following is the recommended way to get the best performance (based on r9y9's experience):

1. Train a post-filter only for ``mgc``
2. Train a post-filter only for ``bap``
3. Merge the two post-filters into one post-filter

Pre-tuned config files are stored in ``recipes/_common/jp_dev_latest/conf/train_postfilter``.

Train post-filter for ``mgc``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To train a post-filter for ``mgc``, you can run the following command:

.. code-block:: bash

    CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
        --acoustic-model acoustic_test \
        --postfilter-model postfilter_mgc_test \
        --postfilter-train mgc

Note that you must specify ``--postfilter-train mgc``. This tells the training script to only use the ``mgc`` feature stream. Other streams such as ``lf0`` and ``bap`` are ignored.

.. warning::

    Training a post-filter for ``mgc`` requires larger amount of GPU VRAM than the normal acoustic model training at the moment. Try using a smaller batch size.

Once the training is finished, you can find model checkpoints in the ``exp`` directory:

.. code-block::

    $ tree exp/yoko/postfilter_mgc_test

    exp/yoko/postfilter_mgc_test
    ├── best_loss.pth
    ├── best_loss_D.pth
    ├── config.yaml
    ├── epoch0002.pth
    ├── epoch0002_D.pth
    ├── latest.pth
    ├── latest_D.pth
    └── model.yaml

Some notes:

- ``*_D.pth`` is the model checkpoint for the discriminator. D stands for discriminators.
- ``model.yaml`` includes configs for both generator and discrimiantor.

Train post-filter for ``bap``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 8 --stop-stage 8 \
        --acoustic-model acoustic_test \
        --postfilter-model postfilter_bap \
        --postfilter-train bap

Note that you must specify ``--postfilter-train bap``. This tells the training script to only use the ``bap`` feature stream.


Merge the two post-filters
^^^^^^^^^^^^^^^^^^^^^^^^^^

This step is not included in the recipe. So you need to manually run the following command to merge the two post-filters:


.. code-block::

    python ../../../utils/merge_postfilters.py exp/yoko/postfilter_mgc_test/latest.pth \
        exp/yoko/postfilter_bap_test/latest.pth \
        exp/yoko/postfilter_merged

Then, you can see the merged post-filter in the ``exp/yoko/postfilter_merged`` directory.

.. code-block::

    $ tree exp/yoko/postfilter_merged/

    exp/yoko/postfilter_merged/
    ├── latest.pth
    └── model.yaml

Packing models with post-filter
--------------------------------

As the same as in :doc:`recipes`, you can pack the models into a single directory by running stage 99. Please make sure to specify the merged post-filter like:

.. code-block:: bash

    CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 99 --stop-stage 99 \
        --timelag-model timelag_test \
        --duration-model duration_test \
        --acoustic-model acoustic_test \
        --postfilter-model postfilter_merged

The above command should make a packed model directory with your trained post-filter.

How to use the packed model with the trained post-filter?
----------------------------------------------------------

Please specify ``post_filter_type="nnsvs"`` with the :doc:`modules/svs` module. An example:

.. code-block::

    import numpy as np
    import pysinsy
    from nnmnkwii.io import hts
    from nnsvs.pretrained import retrieve_pretrained_model
    from nnsvs.svs import SPSVS
    from nnsvs.util import example_xml_file

    model_dir = "/path/to/your/packed/model_dir"
    engine = SPSVS(model_dir)

    contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over"))
    labels = hts.HTSLabelFile.create_from_contexts(contexts)

    wav, sr = engine.svs(labels, post_filter_type="nnsvs")


Tips for training post-filters
------------------------------

If you look into the post-filter configs, you will find many parameters. Here are the tips if you want to turn by yourself:

Train configs
^^^^^^^^^^^^^

- ``fm_weight``: The weight of the feature matching loss. By increasing the weight, you may get more stable results with a possible loss of naturalness. By setting ``fm_weight`` to zero, training will get unstable.
- ``adv_weight``: The weight of the adversarial loss. By increasing the weight, you may get better naturalness.
- ``mse_weight``: The weight of the MSE loss. If you set non-zero value, you will get smoother output features.

Model configs
^^^^^^^^^^^^^^
- ``smoothed_width``: The width of the smoothing window. If you set non-zero value, you will get smoother outputs. This is useful to reduce audible artifacts. Only used for inference.

Details of post-filter implementation
-------------------------------------

You don't need to understand the details if you just want to try, but please look into :cite:t:`Kaneko2017Interspeech`, :cite:t:`kaneko2017generative` if you are interested.