Multi Speaker Natural Speech Synthesis Using Generative Flows

Abstract

Modern speech synthesis systems generate natural speech and have high performance. Models using generative flows, among others, have shown impressive results, allowing you to form a variety of speech pronunciation from a given text. However, they are focused on synthesizing the voice of one given speaker. Despite the recently proposed techniques for taking into account several speakers in training, the quality of multi speaker speech synthesis leaves much to be desired. This paper proposes techniques to improve the quality of multi speaker synthesis using acoustic models based on generative flows. As one of such techniques, it is proposed to obtain information on the alignment along the time axis between a speech audio signal and a text sequence from an external system. Such forced alignments allow you to determine at what point in time which sound was uttered and is necessary for the considered parallel speech synthesis system, since it allows you to solve the problem of mismatching the lengths of the input and output sequences. An external alignment system is more accurate than internal heuristics for training, since it is able to learn on a larger amount of data and therefore has a greater generalizing ability. Another proposed technique is to use real vectors of fixed dimension obtained from the external system, containing information about the speaker, the speaker embeddings. In this paper, speaker embeddings obtained from the system for solving the problem of speaker verification are considered. Such representations of a speaker have the property that embeddings obtained from speech fragments of one speaker are located side by side in space, and embeddings obtained from speech fragments of different speakers are far from each other. Due to such representations of the speaker, the synthesis system better forms speech with the voices of different speakers.

Author Biography

Dmitry Sergeevich Obukhov, Novosibirsk State Technical University; Dasha.AI

Postgraduate Student; Junior Researcher

References

1. Tan X., Qin T., Soong F., Liu T.-Y. A Survey on Neural Speech Synthesis. arXiv:2106.15561. 2021. (In Eng.) doi: https://doi.org/10.48550/arXiv.2106.15561
2. Ren Y., Ruan Y., Tan X., Qin T., Zhao S., Zhao Z., Liu T.-Y. FastSpeech: fast, robust and controllable text to speech. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA; 2019. Article number: 285. p. 3171-3180. Available at: https://dl.acm.org/doi/abs/10.5555/3454287.3454572 (accessed 15.10.2021). (In Eng.)
3. Łańcucki A. Fastpitch: Parallel Text-to-Speech with Pitch Prediction. ICASSP 2021 ‒ 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, Toronto, ON, Canada; 2021. p. 6588-6592. (In Eng.) doi: https://doi.org/10.1109/ICASSP39728.2021.9413889
4. Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., Liu T.- Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.0455. 2020. (In Eng.) doi: https://doi.org/10.48550/arXiv.2006.04558
5. Kim J., Kim S., Kong J., Yoon S. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H. (eds.) Advances in Neural Information Processing Systems (NeurIPS 2020). Vol. 33. Curran Associates, Inc.; 2020. p. 8067-8077. Available at: https://arxiv.org/pdf/2005.11129.pdf (accessed 15.10.2021). (In Eng.)
6. Zeng Z., Wang J., Cheng N., Xia T., Xiao J. Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment. ICASSP 2020 ‒ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, Barcelona, Spain; 2020. p. 6714-6718. (In Eng.) doi: https://doi.org/10.1109/ICASSP40776.2020.9054119
7. Chen M., Tan X., Li B., Liu Y., Qin T., Zhao S., Liu T.-Y. AdaSpeech: Adaptive Text to Speech for Custom Voice. Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021). Vienna, Austria; 2021. p. 1-10. Available at: https://openreview.net/pdf?id=Drynvt7gg4L (accessed 15.10.2021). (In Eng.)
8. Kireev N.S., Ilyushin E.A. Review of existing text-to-speech algorithms. International Journal of Open Information Technologies. 2020; 8(7):84-90. Available at: https://elibrary.ru/item.asp?id=43128091 (accessed 15.10.2021). (In Russ., abstract in Eng.)
9. Lan G., Fadeev A.S., Morgunov A.N. Synthesis of human voice fragments based on frequency spectra reconstruction. Doklady Tomskogo gosudarstvennogo universiteta sistem upravlenija i radiojelektroniki = Proceedings of the TUSUR University. 2021; 24(2):14-20. (In Russ., abstract in Eng.) doi: https://doi.org/10.21293/1818-0442-2021-24-2-14-20
10. Cooper F.S., Gaitenby J.H., Nye P.W. Evolution of reading machines for the blind: Haskins Laboratories' research as a case history. Journal of Rehabilitation Research and Development. 1984; 21(1):51-87. (In Eng.)
11. Wang Y., Skerry-Ryan R.J., Stanton D., Wu Y., Weiss R.J., Jaitly N., Yang Z., Xiao Y., Chen Z., Bengio S., Le Q., Agiomyrgiannakis Y., Clark R., Saurous R.A. Tacotron: Towards End-to-End Speech Synthesis. Proc. Interspeech. 2017. p. 4006-4010. (In Eng.) doi: https://doi.org/10.21437/Interspeech.2017-1452
12. Plyuhina G.A. Computer modeling of foreign speech act. Nauchnyij rezerv = Scientific reserve. 2019; (1):91-97. Available at: https://elibrary.ru/item.asp?id=38469159 (accessed 15.10.2021). (In Russ., abstract in Eng.)
13. Kaliev A., Rybin S.V. Speech synthesis: past and present. Komp'juternye instrumenty v obrazovanii = Computer Tools in Education. 2019; (1):5-28. (In Russ., abstract in Eng.) doi: https://doi.org/10.32603/2071-2340-2019-1-5-28
14. Izrailova E.S. Features of machine learning by CNN within the speech synthesis. Vestnik GGNTU. Tehnicheskie nauki = Herald Of GSTOU. Engineering Sciences. 2019; 15(2):29-35. (In Russ., abstract in Eng.) doi: https://doi.org/10.34708/GSTOU.2019.16.2.004
15. Valle R., Li J., Prenger R., Catanzaro B. Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens. ICASSP 2020 ‒ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, Barcelona, Spain; 2020. p. 6189-6193. (In Eng.) doi: https://doi.org/10.1109/ICASSP40776.2020.9054556
16. Valle R., Shih K.J., Prenger R., Catanzaro B. Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021). Vienna, Austria; 2021. p. 1-17. Available at: https://openreview.net/pdf?id=Ig53hpHxS4 (accessed 15.10.2021). (In Eng.)
17. Aggarwal V., Cotescu M., Prateek N., Lorenzo-Trueba J., Barra-Chicote R. Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech. ICASSP 2020 ‒ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press; 2020. p. 6179-6183. (In Eng.) doi: https://doi.org/10.1109/ICASSP40776.2020.9053678
18. Vovk I.Yu., Gogoryan V.S. Natural speech synthesis system for the Russian language based on deep neural networks. Novye informacionnye tehnologii v avtomatizirovannyh sistemah = New information technologies in automated systems. 2019; (22):142-150. Available at: https://www.elibrary.ru/item.asp?id=41200093 (accessed 15.10.2021). (In Russ., abstract in Eng.)
19. Bengio S., Vinyals O., Jaitly N., Shazeer N. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS'15). Vol. 1. MIT Press, Cambridge, MA, USA; 2015. p. 1171-1179. Available at: https://proceedings.neurips.cc/paper/2015/file/e995f98d56967d946471af29d7bf99f1-Paper.pdf (accessed 15.10.2021). (In Eng.)
20. Miao C., Liang S., Chen M., Ma J., Wang S., Xiao J. Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. ICASSP 2020 ‒ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press; 2020. p. 7209-7213. (In Eng.) doi: https://doi.org/10.1109/ICASSP40776.2020.9054484
21. Kingma D.P., Dhariwal P. Glow: Generative Flow with Invertible 1x1 Convolutions. In: Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R. (eds.) Advances in Neural Information Processing Systems (NeurIPS 2018). Vol. 31. Curran Associates, Inc.; 2018. p. 1-15. Available at: https://arxiv.org/pdf/1807.03039.pdf (accessed 15.10.2021). (In Eng.)
22. Prenger R., Valle R., Catanzaro B. Waveglow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 ‒ 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press; 2019. p. 3617-3621. (In Eng.) doi: https://doi.org/10.1109/ICASSP.2019.8683143
23. Arık S.Ö., Chrzanowski M., Coates A., Diamos G., Gibiansky A., Kang Y., Li X., Miller J., Ng A., Raiman J., Sengupta S., Shoeybi M. Deep Voice: Real-time Neural Text-to-Speech. In: Precup D., Teh Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning (PMLR). 2017; 70:195-204. Available at: http://proceedings.mlr.press/v70/arik17a/arik17a.pdf (accessed 15.10.2021). (In Eng.)
24. Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlicek P., Qian Y., Schwarz P., Silovsky J., Stemmer G., Vesely K. The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society; 2011. Available at: https://infoscience.epfl.ch/record/192584 (accessed 15.10.2021). (In Eng.)
25. Jia Y., Zhang Y., Weiss R.J., Wang Q., Shen J., Ren F., Chen Z., Nguyen P., Pang R., Moreno I.L. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. In: Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R. (eds.) Advances in Neural Information Processing Systems (NeurIPS 2018). Vol. 31. Curran Associates, Inc.; 2018. p. 1-15. Available at: https://arxiv.org/pdf/1806.04558.pdf (accessed 15.10.2021). (In Eng.)
26. Desplanques B., Thienpondt J., Demuynck K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc. Interspeech. 2020. p. 3830-3834. (In Eng.) doi: https://doi.org/10.21437/Interspeech.2020-2650
27. King S., Karaiskos V. The Blizzard Challenge 2013. Proc. Blizzard Challenge 2013 Workshop. LTI at Carnegie Mellon University; 2013. Available at: http://festvox.org/blizzard/bc2013/summary_Blizzard2013.pdf (accessed 15.10.2021). (In Eng.)
28. Bakhturina E., Lavrukhin V., Ginsburg B., Zhang Y. Hi-Fi Multi-Speaker English TTS Dataset. Proc. Interspeech. 2021. p. 2776-2780. (In Eng.) doi: https://doi.org/10.21437/Interspeech.2021-1599
29. Zen H., Dang V., Clark R., Zhang Y., Weiss R.J., Jia Y., Chen Z., We Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. arXiv:1904.02882. 2019. Available at: https://arxiv.org/pdf/1904.02882.pdf (accessed 15.10.2021). (In Eng.)
30. Kong J., Kim J., Bae J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H. (eds.) Advances in Neural Information Processing Systems (NeurIPS 2020). Vol. 33. Curran Associates, Inc.; 2020. p. 1-12. Available at: https://arxiv.org/pdf/2010.05646.pdf (accessed 15.10.2021). (In Eng.)
Published
2021-12-20
How to Cite
OBUKHOV, Dmitry Sergeevich. Multi Speaker Natural Speech Synthesis Using Generative Flows. Modern Information Technologies and IT-Education, [S.l.], v. 17, n. 4, p. 896-905, dec. 2021. ISSN 2411-1473. Available at: <http://sitito.cs.msu.ru/index.php/SITITO/article/view/807>. Date accessed: 03 july 2024. doi: https://doi.org/10.25559/SITITO.17.202104.896-905.