Click here to download project base paper of text to voice converter project.
Abstract:
We Provide Advance Deep Learning -text to voice converter project. Herer Voice conversion (VC) refers to the process of altering the acoustic characteristics of a person’s voice (source) to another person’s voice (target) while preserving linguistic content (Mohammadi and Kain 2017). It has a wide range of applications such as the creation of personalized voices for those with speech disabilities, vocal identity protection, and entertainment purposes on short-video platforms. Intelligibility and speaker similarity are two important criteria to evaluate a VC model, the former measures the degree to which spoken words or utterances in the converted speech
are clear and understandable to listeners, while the latter reflects the similarity between the converted speech and the target speech. A good VC model should achieve high intelligibility and produce speech characteristics similar to target speakers simultaneously.There have been significant advances for VC (Li, Tu, and
Xiao 2023; Qian et al. 2019; Chou and Lee 2019; Casanova et al. 2022) recently due to the development of novel generative models such as VAE, GAN, flow models.
Voice conversion (VC) aims at altering a person’s voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.