From Mickey Mouse to Bugs Bunny to Homer Simpson: done well, it can be easy to forget an animated character is voiced by a real actor who’s then studied, drawn and animated from scratch.
Yet this long animation tradition is being quietly revolutionized by UEA computing scientists, looking to turn the labour-intensive drawing of thousands of frames into the work of a matter of seconds.
Often famous actors spend hours in the booth recording their voice: Walt Disney himself voiced the helium-pitched Mickey Mouse. But have you ever wondered how the animated faces are matched so precisely to the words spoken by the distinctive and often famous voice?
In animated films once an audio recording is made, digital artists will position the characters’ faces, matching the facial motion to the spoken word. Usually, the animator will use their own face in the mirror as a guide, and can spend hours animating just a few seconds of content to get the right lip movements just right to convey both the intended meaning and its emotion.
UEA’s Dr Sarah Taylor is interested in whether this process could be done both digitally and automatically, potentially turning hours of painstaking work creating speech animation into a matter of seconds.
Animated movies either use banks of digital artists to pose characters’ faces by hand, frame by frame, or they use motion capture - tracking sensors on an actor’s face and transferring the facial movement to a digital character. Although a more technological approach, motion capture for high fidelity animation can be expensive and requires the actor’s performance to be recorded for every part of the dialogue.
Instead, Taylor’s research focuses on automatically generating facial animation directly from the audio speech. This is achieved by firstly filming text spoken by an actor, covering the essential speech sounds. Computers can then analyse short audio sections and transform them into digital maps of the actors’ lip patterns. These motions can be transferred onto CG characters and gives animators a digital model by which they can make an animated face "say" whatever they wish, in a believable way.
Working in collaboration with innovative entertainment technology leaders, Taylor hopes that the project’s results will allow the building of tools for animation and computer games studios that could significantly improve the quality and production speed of animated speech content and introduce automation into the process. (see the Automatic Speech Animation project: la.disneyresearch.com/publication/deep-learning-speech-animation/)
Automatic speech animation video
UEA’s research has focused on learning individual lip movements and how they stitch together visually to make speech. Taylor, together with researchers at Disney Research, recorded an actor saying 2.5 thousand sentences to create and ‘train’ a computer model to predict facial shapes and lip motion from phonemes – the fundamental units of sound in spoken language.
From the recorded speech, phonemes were annotated and algorithms developed to track the contours of the lips and jaw. Data scientists could then automatically extract features describing both the shape and appearance of the face in each video frame, even considering transient effects such as shadows.
The team are also investigating using raw audio speech from any speaker as input and mapping the generated animation onto any number of characters’ faces.
As well as providing sophisticated tools for animators to automate English speech animation, there are implications for animating other languages or speaking styles - such as shouting/whispering, and emotional speech.
In the Speech Lab at UEA, researchers are also working on non-verbal cues in conversation and their effect on how meaning is inferred, something that could improve the realism and believability of animated characters.
Non-verbal communication makes up a large part of spoken interaction – visual clues such as eye contact, facial expression or gesture can significantly affect how spoken words are perceived. UEA researchers are therefore interested in what this means for automatic production of animated speech, such as analysing how the same section of speech is interpreted when animated both with and without motion of the head added.
Taylor first became interested in this area during an undergraduate 3rd year project which focused on animating expressive faces - her work included developing an algorithm to animate a 3D facial mesh to perform expressions in a realistic way.
Sarah progressed onto a PhD looking at animating speech from audio signals and subsequently spent several years at Disney Research Pittsburgh, joining them in 2013 as a Postdoctoral Research Associate with the Computer Vision group.
Next steps
Taylor is now Senior Research Associate at UEA and in 2015, together with supervisor Dr Ben Milner and Dr Barry-John Theobald, was awarded an EPSRC grant of £343,515 to look at speech animation using visemes.
The project is investigating new methods for automatically producing speech animation from raw audio speech of any speaker in real time.
They have developed a sliding window neural network model that learns a mapping from a segment of audio speech to a segment of lip motion. For new speech, the predicted lip motions are averaged to generate continuous, smoothly varying speech animation.
They ultimately aim to build tools that can be implemented in commercial animation studios on their own models, leaving their artists free to focus on the overall performance of the character.
Related content
PRESS RELEASE: UEA research to revolutionise animated characters' speech