Speaking another language may be getting easier. Google is showing off Translatotron, a first-of-its-kind translation model that can directly convert speech from one language into another while maintaining a speaker’s voice and cadence.
The tool forgoes the usual step of translating speech to text and back to speech, which can often lead to errors along the way. Instead, the end-to-end technique directly translates a speaker’s voice into another language. The company is hoping the development will open up future developments using the direct translation model.
According to Google, Translatotron uses a sequence-to-sequence network model that takes a voice input, processes it as a spectrogram — a visual representation of frequencies — and generates a new spectrogram in a target language. The result is a much faster translation with less likelihood of something getting lost along the way.
The tool also works with an optional speaker encoder component, which works to maintain a speaker’s voice. The translated speech is still synthesized and sounds a bit robotic, but can effectively maintain some elements of a speaker’s voice. You can listen to samples of Translatotron’s attempts to maintain a speaker’s voice as it completes translations on Google Research’s GitHub page. Some are certainly better than others, but it’s a start.