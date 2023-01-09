Vall-E AI voice tool can mimic human speech in seconds The Vall-E AI tool only requires a few seconds of training to mimic human speech.

AI technology continues to evolve, automating services and altering the way we interact with one another. Some forms of AI are able to mimic human speech, though the process of training an AI to learn and reproduce a human voice can usually take a while. However, the Vall-E AI voice tool has astonished audiences with how fast it can do such a thing. This particular AI voice tool can mimic a person’s speech after just a few seconds of training.

As reported by PC Gamer, Vall-E was created by a group of researchers at Microsoft. Its advantage over similar AI voice tools is that it can replicate human speech with a very small sample size. Roughly, Vall-E only needs three seconds of training in order to mimic a person’s speech. A paper was published at Cornell University (which we learned of from Windows Central) that breaks down the difference between Vall-E and other text-to-speech synthesizers.



Source: Vall-E

Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation. Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.

VALL-E significantly outperforms the state-of-the-art zero-shot TTS system [Casanova et al., 2022b] in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech. VALL-E also beats the baseline on VCTK with +0.11 SMOS and +0.23 CMOS improvements.

Vall-E seems like the next major step forward for AI text-to-speech synthesizers.