Papers

Parametric SVS

Zhuang et al. [ZJC+21]
Hono et al. [HHO+21]
Zhuang et al. [ZJC+21]
Hono et al. [HMN+18]
Bonada and Blaauw [BB18]
Blaauw and Bonada [BB17]
Oura et al. [OMY+10]
Lu et al. [LWL+20]
Saino et al. [SZN+06]

Towards end-to-end SVS

Zhang et al. [ZCX+22]
Liu et al. [LLR+21]
Gu et al. [GYR+21]
Chen et al. [CTL+20]

F0 modeling

Wang et al. [WTY18]

Vibrato modeling

山田知彦 et al. [+09]
Nakano et al. [NGH06]

Post-filters

Kaneko et al. [KTKY17b]
Kaneko et al. [KTKY17a]
Takamichi et al. [TTB+16]
Silén et al. [SilenHNG12]

TTS

Ren et al. [RHQ+21]
Okamoto et al. [OTSK19]
Wang et al. [WLTT+18]
Shen et al. [SPW+18]
Wu et al. [WWK16]
Takamichi et al. [TKT+15]
Zen and Senior [ZS14]
Zen et al. [ZTB09]

Vocoder

Kumar et al. [KKdB+19]
Morise et al. [MMO17]
Morise et al. [MYO16]

Neural vocoders

Wang et al. [WTY19]

Database

Wang et al. [WWZ+22]

All bibliography

[BB17]

Merlijn Blaauw and Jordi Bonada. A neural parametric singing synthesizer modeling timbre and expression from natural songs. Applied Sciences, 7(12):1313, 2017.

[BB18]

Jordi Bonada and Merlijn Blaauw. Recent advances in our neural parametric singing synthesizer. 2018.

[CTL+20]

Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. Hifisinger: towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776, 2020.

[GYR+21]

Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. Bytesing: a chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders. In Proc. ISCSLP, 1–5. IEEE, 2021.

[HHO+21]

Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM Trans. on Audio, Speech, and Lang. Process., 29:2803–2815, 2021.

[HMN+18]

Yukiya Hono, Shumma Murata, Kazuhiro Nakamura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Recent development of the dnn-based singing voice synthesis system—sinsy. In Proc. APSIPA, 1003–1009. IEEE, 2018.

[KTKY17a]

Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi. Generative adversarial network-based postfilter for stft spectrograms. In Proc. Interspeech, 3389–3393. 2017.

[KTKY17b]

Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi. Generative adversarial network-based postfilter for stft spectrograms. In Proc. Interspeech. August 2017.

[KKdB+19]

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 2019.

[LLR+21]

Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Peng Liu, and Zhou Zhao. Diffsinger: singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446, 2021.

[LWL+20]

Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. Xiaoicesing: a high-quality and integrated singing voice synthesis system. arXiv preprint arXiv:2006.06261, 2020.

[MMO17]

Masanori Morise, Genta Miyashita, and Kenji Ozawa. Low-Dimensional Representation of Spectral Envelope Without Deterioration for Full-Band Speech Analysis/Synthesis System. In Proc. Interspeech 2017, 409–413. 2017. doi:10.21437/Interspeech.2017-67.

[MYO16]

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7):1877–1884, 2016.

[NGH06]

Tomoyasu Nakano, Masataka Goto, and Yuzuru Hiraga. An automatic singing skill evaluation method for unknown melodies using pitch interval accuracy and vibrato features. In Proc. Interspeech, 1706–1709. 2006.

[OTSK19]

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 214–221. IEEE, 2019.

[OMY+10]

Keiichiro Oura, Ayami Mase, Tomohiko Yamada, Satoru Muto, Yoshihiko Nankaku, and Keiichi Tokuda. Recent development of the hmm-based singing voice synthesis system—sinsy. In Proc. SSW. 2010.

[RHQ+21]

Yi Ren, Chenxu Hu, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and high-quality end-to-end text-to-speech. In Proc. ICLR. 2021.

[SZN+06]

Keijiro Saino, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, and Keiichi Tokuda. An hmm-based singing voice synthesis system. In Ninth International Conference on Spoken Language Processing. 2006.

[SPW+18]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proc. ICASSP, 4779–4783. 2018.

[SilenHNG12]

Hanna Silén, Elina Helander, Jani Nurminen, and Moncef Gabbouj. Ways to implement global variance in statistical speech synthesis. In Thirteenth Annual Conference of the International Speech Communication Association. 2012.

[TKT+15]

Shinnosuke Takamichi, Kazuhiro Kobayashi, Kou Tanaka, Tomoki Toda, and Satoshi Nakamura. The naist text-to-speech system for the blizzard challenge 2015. In Proc. Blizzard Challenge workshop, volume 2. Berlin, Germany, 2015.

[TTB+16]

Shinnosuke Takamichi, Tomoki Toda, Alan W Black, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4):755–767, 2016.

[WLTT+18]

Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, and Junichi Yamagishi. A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis. In Proc. ICASSP, 4804–4808. IEEE, 2018.

[WTY18]

Xin Wang, Shinji Takaki, and Junichi Yamagishi. Autoregressive neural f0 model for statistical parametric speech synthesis. IEEE/ACM Trans. on Audio, Speech, and Lang. Process., 26(8):1406–1419, 2018.

[WTY19]

Xin Wang, Shinji Takaki, and Junichi Yamagishi. Neural source-filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:402–415, 2019.

[WY19]

Xin Wang and Junichi Yamagishi. Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis. arXiv preprint arXiv:1908.10256, 2019.

[WWZ+22]

Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, and Mengxiao Bi. Opencpop: a high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429, 2022.

[WWK16]

Zhizheng Wu, Oliver Watts, and Simon King. Merlin: an open source neural network speech synthesis system. In SSW, 202–207. 2016.

[ZS14]

Heiga Zen and Andrew Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), 3844–3848. IEEE, 2014.

[ZTB09]

Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009.

[ZCX+22]

Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, and Mengxiao Bi. Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7237–7241. IEEE, 2022.

[ZJC+21] (1,2)

Xiaobin Zhuang, Tao Jiang, Szu-Yu Chou, Bin Wu, Peng Hu, and Simon Lui. Litesing: towards fast, lightweight and expressive singing voice synthesis. In Proc. ICASSP, 7078–7082. 2021.

[+09]

山田知彦, 武藤聡, 南角吉彦, 酒向慎司, 徳田恵一, and others. Hmm に基づく歌声合成のためのビブラートモデル化. 研究報告音楽情報科学 (MUS), 2009(5):1–6, 2009.