Paper
22 April 2022 D-MelGAN: speech synthesis with specific voiceprint features
Daigang Chen, Hua Jiang, Chengxi Pu, Shaowen Yao
Author Affiliations +
Proceedings Volume 12163, International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021); 121631I (2022) https://doi.org/10.1117/12.2627502
Event: International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021), 2021, Nanjing, China
Abstract
In recent years, speech synthesis based on machine learning has become more and more popular. At present, there are many kinds of neural network models that can generate synthetic audio which highly imitates human voice. The quality of these generated audio is usually evaluated by mean opinion score (MOS). Voiceprint is an important metric to distinguish the speaker's speech features. Generating voice speech with specific voiceprint features is of great significance to improve the application of speech synthesis. However, the existing speech synthesis models seldom consider the preservation of specific voiceprint features. In this paper, we propose D-MelGAN, a speech synthesis model targeting to high-quality voice speech with specific speaker voiceprint features. The model is based on the non-autoregressive feedforward convolution neural network of GANs. By embedding the d-vector technology used to identify specific voiceprints in GANs, the original audio waveform with the characteristics of specific speaker voiceprints is further generated. The experimental results show that the new model can increase the voiceprint features of the generated audio, and the quality of the synthesized speech can be well maintained, which will make the generated speech have the specific style of a speaker, the text to speech technology will be applied to more fields.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daigang Chen, Hua Jiang, Chengxi Pu, and Shaowen Yao "D-MelGAN: speech synthesis with specific voiceprint features", Proc. SPIE 12163, International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021), 121631I (22 April 2022); https://doi.org/10.1117/12.2627502
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Autoregressive models

Convolution

Speaker recognition

Neural networks

Systems modeling

Performance modeling

Back to Top