nnsvs.svs

BaseSVS

class nnsvs.svs.BaseSVS[source]

Base class for singing voice synthesis (SVS) inference

All SVS engines should inherit from this class.

The input of the SVS engine uses the HTS-style full-context labels. The output should be a tuple of raw waveform and sampling rate. To allow language-independent SVS, this base class does not define the interface for the frontend functionality such as converting musicXML/UST to HTS labels. The frontend processing should be done externally (e.g., using pysinsy or utaupy) or can be implemented with an optional method.

svs(labels, *args, **kwargs)[source]

Run SVS inference and returns the synthesized waveform

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels

Returns:

(waveform, sampling rate)

Return type:

tuple

SPSVS

class nnsvs.svs.SPSVS(model_dir, device='cpu', verbose=0)[source]

Statistical parametric singing voice synthesis (SPSVS)

Use the svs method for the simplest inference, or use the separated methods (e.g.,``predict_acoustic`` and predict_waveform) to control each components of the parametric SVS system.

Parameters:
  • model_dir (str) – directory of the model

  • device (str) – cpu or cuda

  • verbose (int) – verbosity level

Examples:

Synthesize wavefrom from a musicxml file

import numpy as np
import pysinsy
from nnmnkwii.io import hts
from nnsvs.pretrained import retrieve_pretrained_model
from nnsvs.svs import SPSVS
from nnsvs.util import example_xml_file
import matplotlib.pyplot as plt

# Instantiate the SVS engine
model_dir = retrieve_pretrained_model("r9y9/yoko_latest")
engine = SPSVS(model_dir)

# Extract HTS labels from a MusicXML file
contexts = pysinsy.extract_fullcontext(example_xml_file(key="get_over"))
labels = hts.HTSLabelFile.create_from_contexts(contexts)

# Run inference
wav, sr = engine.svs(labels)

# Plot the result
fig, ax = plt.subplots(figsize=(8,2))
librosa.display.waveshow(wav.astype(np.float32), sr=sr, ax=ax)
../_images/svs-1.png

With WORLD vocoder:

>>> wav, sr = engine.svs(labels, vocoder_type="world")

With a uSFGAN or SiFiGAN vocoder:

>>> wav, sr = engine.svs(labels, vocoder_type="usfgan")
set_device(device)[source]

Set device for the SVS model

Parameters:

device (str) – cpu or cuda.

predict_timelag(labels)[source]

Predict time-ag from HTS labels

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

Returns:

Predicted time-lag.

Return type:

ndarray

predict_duration(labels)[source]

Predict durations from HTS labels

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

Returns:

Predicted durations.

Return type:

ndarray

postprocess_duration(labels, pred_durations, lag)[source]

Post-process durations

Parameters:
  • labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

  • pred_durations (ndarray) – Predicted durations.

  • lag (ndarray) – Predicted time-lag.

Returns:

duration modified HTS labels.

Return type:

nnmnkwii.io.hts.HTSLabelFile

predict_timing(labels)[source]

Predict timing from HTS labels

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

Returns:

duration modified HTS labels.

Return type:

nnmnkwii.io.hts.HTSLabelFile

predict_acoustic(duration_modified_labels, f0_shift_in_cent=0)[source]

Predict acoustic features from HTS labels

Parameters:
  • duration_modified_labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

  • f0_shift_in_cent (float) – F0 shift in cent.

Returns:

Predicted acoustic features.

Return type:

ndarray

postprocess_acoustic(duration_modified_labels, acoustic_features, post_filter_type='gv', trajectory_smoothing=True, trajectory_smoothing_cutoff=50, trajectory_smoothing_cutoff_f0=20, vuv_threshold=0.5, force_fix_vuv=False, fill_silence_to_rest=False, f0_shift_in_cent=0)[source]

Post-process acoustic features

The function converts acoustic features in single ndarray to tuple of multi-stream acoustic features.

e.g., array -> (mgc, lf0, vuv, bap)

If post_filter_type=``nnsvs`` is specified, learned post-filter is applied. However, it is recommended to use gv in general.

Parameters:
  • duration_modified_labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels.

  • acoustic_features (ndarray) – Predicted acoustic features.

  • post_filter_type (str) – Post-filter type. One of gv, merlin or nnsvs. Recommended to use gv for general purpose.

  • trajectory_smoothing (bool) – Whether to apply trajectory smoothing.

  • trajectory_smoothing_cutoff (float) – Cutoff frequency for trajectory smoothing of spectral features.

  • trajectory_smoothing_cutoff_f0 (float) – Cutoff frequency for trajectory smoothing of f0.

  • vuv_threshold (float) – V/UV threshold.

  • force_fix_vuv (bool) – Force fix V/UV.

  • fill_silence_to_rest (bool) – Fill silence to rest frames.

  • f0_shift_in_cent (float) – F0 shift in cent.

Returns:

Post-processed multi-stream acoustic features.

Return type:

tuple

predict_waveform(multistream_features, vocoder_type='world', vuv_threshold=0.5)[source]

Predict waveform from acoustic features

Parameters:
  • multistream_features (tuple) – Multi-stream acoustic features.

  • vocoder_type (str) – Vocoder type. One of world, pwg or usfgan. If auto is specified, the vocoder is automatically selected.

  • vuv_threshold (float) – V/UV threshold.

Returns:

Predicted waveform.

Return type:

ndarray

postprocess_waveform(wav, dtype=<class 'numpy.int16'>, peak_norm=False, loudness_norm=False, target_loudness=-20)[source]

Post-process waveform

Parameters:
  • wav (ndarray) – Waveform.

  • dtype (dtype) – Data type of waveform.

  • peak_norm (bool) – Whether to apply peak normalization.

  • loudness_norm (bool) – Whether to apply loudness normalization.

  • target_loudness (float) – Target loudness in dB.

Returns:

Post-processed waveform.

Return type:

ndarray

svs(labels, vocoder_type='world', post_filter_type='gv', trajectory_smoothing=True, trajectory_smoothing_cutoff=50, trajectory_smoothing_cutoff_f0=20, vuv_threshold=0.5, style_shift=0, force_fix_vuv=False, fill_silence_to_rest=False, dtype=<class 'numpy.int16'>, peak_norm=False, loudness_norm=False, target_loudness=-20, segmented_synthesis=False)[source]

Synthesize waveform from HTS labels.

Parameters:
  • labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels

  • vocoder_type (str) – Vocoder type. One of world, pwg or usfgan. If auto is specified, the vocoder is automatically selected.

  • post_filter_type (str) – Post-filter type. merlin, gv or nnsvs is supported.

  • trajectory_smoothing (bool) – Whether to smooth acoustic feature trajectory.

  • trajectory_smoothing_cutoff (int) – Cutoff frequency for trajectory smoothing.

  • trajectory_smoothing_cutoff_f0 (int) – Cutoff frequency for trajectory smoothing of f0.

  • vuv_threshold (float) – Threshold for VUV.

  • style_shift (int) – style shift parameter

  • force_fix_vuv (bool) – Whether to correct VUV.

  • fill_silence_to_rest (bool) – Fill silence to rest frames.

  • dtype (np.dtype) – Data type of the output waveform.

  • peak_norm (bool) – Whether to normalize the waveform by peak value.

  • loudness_norm (bool) – Whether to normalize the waveform by loudness.

  • target_loudness (float) – Target loudness in dB.

  • segmneted_synthesis (bool) – Whether to use segmented synthesis.

NEUTRINO

class nnsvs.svs.NEUTRINO(model_dir, device='cpu', verbose=0)[source]

NEUTRINO-like interface for singing voice synthesis

Parameters:
  • model_dir (str) – model directory

  • device (str) – device name

  • verbose (int) – verbose level

classmethod musicxml2label(input_file)[source]

Convert musicXML to full and mono HTS labels

Parameters:

input_file (str) – musicXML file

get_num_phrases(labels)[source]

Get number of phrases

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS label

Returns:

number of phrases

Return type:

int

get_phraselist(full_labels, timing_labels)[source]

Get phraselit from full and timing HTS labels

Parameters:
  • full_labels (nnmnkwii.io.hts.HTSLabelFile) – full HTS label

  • timing_labels (nnmnkwii.io.hts.HTSLabelFile) – timing HTS label

Returns:

phraselist

Return type:

str

predict_acoustic(full_labels, timing_labels=None, style_shift=0, phrase_num=-1, trajectory_smoothing=True, trajectory_smoothing_cutoff=50, trajectory_smoothing_cutoff_f0=20, vuv_threshold=0.5, force_fix_vuv=False, fill_silence_to_rest=False)[source]

Main inference of timing and acoustic predictions

Parameters:
  • full_labels (nnmnkwii.io.hts.HTSLabelFile) – full HTS label

  • timing_labels (nnmnkwii.io.hts.HTSLabelFile) – timing HTS label

  • style_shift (int) – style shift parameter

  • phrase_num (int) – phrase number to use for inference

  • trajectory_smoothing (bool) – whether to apply trajectory smoothing

  • trajectory_smoothing_cutoff (float) – cutoff frequency for trajectory smoothing

  • trajectory_smoothing_cutoff_f0 (float) – cutoff frequency for trajectory smoothing for f0

  • vuv_threshold (float) – V/UV threshold

  • force_fix_vuv (bool) – whether to force fix V/UV

  • fill_silence_to_rest (bool) – Fill silence to rest frames.

Returns:

(f0, mgc, bap)

Return type:

tuple

predict_waveform(f0, mgc, bap, vocoder_type='world', vuv_threshold=0.5, dtype=<class 'numpy.int16'>, peak_norm=False, loudness_norm=False, target_loudness=-20)[source]

Generate waveform from acoustic features

Parameters:
  • f0 (ndarray) – f0

  • mgc (ndarray) – mel-cepstrum

  • bap (ndarray) – band-aperiodicity

  • vocoder_type (str) – vocoder type

  • vuv_threshold (float) – V/UV threshold

  • dtype (np.dtype) – Data type of the output waveform.

  • peak_norm (bool) – Whether to normalize the waveform by peak value.

  • loudness_norm (bool) – Whether to normalize the waveform by loudness.

  • target_loudness (float) – Target loudness in dB.

Returns:

waveform

Return type:

ndarray

svs(labels)[source]

Synthesize wavefrom from HTS labels

Parameters:

labels (nnmnkwii.io.hts.HTSLabelFile) – HTS labels

Returns:

(waveform, sample_rate)

Return type:

tuple