Microsoft's new AI bot VALL-E can replicate anyone’s voice with just a 3-seconds audio sample

Jan 10, 2023 - 15:30

0 48

Microsoft's new AI bot VALL-E can replicate anyone’s voice with just a 3-seconds audio sample

A team of researchers at Microsoft have developed a new text-to-speech AI model called VALL-E that can simulate a person’s voice almost perfectly, once it has been trained. And that in order to train this new AI bot, all they need is a three-second audio sample.

Moreover, the researchers claim that once the AI bot learns a specific voice, VALL-E can synthesize audio of that person saying anything, and do it in a way that attempts to preserve the speaker’s emotional tone.

The developers of VALL-E can potentially be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript, and in conjunction of content creation with other generative AI models like GPT-3.

Microsoft’s VALL-E builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. Bacially, VALL-E analyzes how a person sounds, and breaks down the voice into tokens. Then it uses the training data to match what it “knows” about how that voice would sound if it spoke other phrases.

Microsoft used LibriLight, an audio library put together by Meta, to train VALL-voice E’s synthesis skills. The majority of the 60,000 hours of English-language speech are taken from LibriVox public domain audiobooks and are spoken by more than 7,000 different people. The voice in the three-second sample must closely resemble a voice in the training data for VALL-E to get a satisfactory result.

In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. The audio output, for instance, will imitate the acoustic and frequency qualities of a telephone call in its synthetic output, which is a fancy way of stating that it will sound like a telephone call as well. Additionally, Microsoft’s samples (included in the “Synthesis of Diversity” section) show how VALL-E may produce different voice tones by altering the random seed utilised during creation.

Original Post