Google is showing off a new translation tool that’s even more dynamic and powerful than its Translate app. The tool, called Translatotron, converts speech directly and spits it out back in the user’s own voice, to boot.
Translatotron is a first-of-its-kind translation model that can directly translate speech into another language instead of converting speech to text then converting that back to speech. Usually, speech translation uses automatic speech recognition to convert speech into text, then uses text-to-speech software to generate a translation. But in this new model, there could be less errors in translation. Google hopes this end-to-end technique will open up future developments in terms of converting speech to another language.
As Google explains, Translatotron utilizes a sequence-to-sequence network model that takes a voice input, processes it as a spectrogram, and creates a new spectrogram in a target language. As a result, translations are much faster and less likely to harbor errors along the way. Because the tool works with spectrograms, it can make use of an optional speaker encoder that closely resembles the source voice. That means not only can the Translatotron translate speeches from one language directly into another, it can also match the user’s cadence, pitch, and other facets that make their voice unique.
As Engadget notes, the mimicry isn’t perfect. The resulting voice still sounds robotic, but it does maintain some elements of a speaker’s voice. Audio samples are available on Google Research’s GitHub page. It doesn’t always get it right, sure. But as with any innovation, it’ll take some time before this is perfected. In any case, it’s a start.
“To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language,” Google writes in its blog post. “It is also able to retain the source speaker’s voice in the translated speech.”
To gauge its efficiency and translation quality, Google used the BLEU rubrics of measurement. Though the results lag behind a conventional translation system, Google says it has demonstrated the feasibility of the end-to-end direct speech-to-speech translation. What’s more, Translatotron achieved more accurate translations than baseline cascade translations.
Details about the Translatotron tool is laid out in more detail in a just-published paper called “Direct speech-to-speech translation with a sequence-to-sequence model.” The tool comes a month after Google introduced SpecAugment, an artificial intelligence model that combines computer vision and a bevy of techniques to glean words from spectrogram imagery.