Euros Final 2022: Lip-reading, swear words, and robots
Issue 10: News Section | author: Georgia Clothier
Georgia Clothier, a third-year Linguistics student at the University of Cambridge, writes about advances in lip-reading technology in light of a memorable moment at the 2022 Euros final.
Content warning: contains swearing
Minutes before England’s winning goal in the Euros final, I turned to my family and said “I know exactly what she just said”. Because I, like millions of others, had just seen England midfielder Jill Scott form the words “FUCK OFF YOU FUCKING PRICK” live on the BBC. Then they played it again. In slow motion. “Apologies to any lip-readers” immediately followed from the commentary box, and the next day in The Sun: “BBC apologise after England star Jill Scott’s X-rated blast…”[1].
When I’d finished giggling, I thought about how I could tell exactly what she said without any sound at all, as a normal-hearing person who rarely performs that task. What does it take to be a good lip-reader? Can computers do it?
Estimates suggest that, without any audio, humans miss at least a third of words, filling some in with context[2]. Although there is variability, and some people are very good lip-readers, this is much worse than for audio-only recognition. The main thing making lip-reading so hard for both humans and machines is visual ambiguity. Ambiguities arise because the set of mouth positions or visemes, contrastive “minimal units” of video, is smaller than the set of phonemes, contrastive “minimal units” of audio. This is because some phonemes are distinguished by mouth movements that are visually close or indistinguishable, like pairs of voiced/unvoiced phonemes where the distinguishing gesture is produced at the larynx. Also, many phonemes (e.g. /k/ and /g/) can have very different appearances in different environments. So mapping from visemes to phonemes or words is far from straightforward.
If human performance is iffy at best, we might ask whether machines can do any better. The most successful approaches use neural networks, machine learning models consisting of layers of computing units which each take input values, perform some computation on them, and give an output value. The learning part involves using lots of corpus data to figure out what computations to apply to the input to produce the best approximation of the “correct answers” observed in the corpus. In our case, this means learning what has to be done to the information from the video frames (or any visual input) to make the model “guess” the correct/matching phoneme, word, or sentence (output). After training, we just feed in the test video and see what the model’s best guess is with the settings it learned.
Neural networks are good at using context, especially “recurrent” models like Long Short-Term Memory networks (LSTMs). These form a kind of memory by letting computing units use information from previous computations. LSTMs perform even better when trained on the input forwards and backwards since the model is informed by the preceding and following mouth positions, although this requires having the complete utterance to start with (i.e. not “live” input), and is computationally expensive[3].
Models with this sort of architecture made headlines in 2016: MIT Technology Review announced “AI Has Beaten Humans at Lip-Reading”[4], referring to a paper by Assael et al.[5], who report 93.4% word recognition accuracy when training and testing on the GRID corpus[6]. Hearing-impaired subjects achieved 52.3% on the same task. Indeed, neural networks can exceed human performance in alphabet-level and word/sentence-level tasks[3].
But this must be taken with a rather large pinch of salt. Lip-reading in a realistic setting is still a huge challenge.
Neural networks need lots of training data, which is not so readily available for lip-reading. Even when corpora include lots of utterances, they should also ideally include a range of utterance types or the model will only make good predictions about input that resembles the limited training data. Take the GRID corpus. It includes 34,000 utterances (enough to train a neural network) but all of a constrained type - the format command + colour + preposition + digit + letter + adverb, e.g. “Set blue at A 1 again”. The vocabulary is only 51 words. Since network training involves learning what settings fit the corpus training data, Assael et al.’s model performed excellently when tested on this type of GRID phrase. But it would clearly not be much good up against, say, a furious Jill Scott, whose words were (well) outside the defined vocabulary it chose from. Unlike the GRID videos, Scott was also not sitting still, face-on and uniformly lit, or filmed close-up with a very high resolution camera, or speaking in a relaxed, normal way. There are computational methods to reduce ‘overfitting’ to training data, but until we can access the as-yet unpublished RAJE (Really Angry Jill at Euros) corpus our subtitling task will be tough.
Conversely, humans might have more trouble lip-reading “Set blue at A 1 again” than “FUCK OFF YOU FUCKING PRICK”, because not much in the first sentence helps guess the next word. That is, our knowledge of English tells us that “OFF” is very likely to follow “FUCK”, “ON” is unlikely to follow it, and the whole thing is likely to come from a footballer who has just been fouled in the extra time of the Euro final. We can use language knowledge, discourse context, speaker emotion, and body language to inform our lip-reading, which helps in realistic settings.
Up against so many challenges, it seems reasonable to ask why we even want a lip-reading machine, save pure nosiness. Applications include improving speech recognition/phone dictation in noisy environments, processing silent films, resolving multi-speaker simultaneous speech, and biometric identification[2][5]. And since humans use both visual and auditory clues for speech recognition (see the well-known McGurk experiments[7]), combining these channels is a productive research area.
The main message is that, given the viseme/phoneme mismatch problem, context seems key for lip-reading. Machines’ success here may depend on how much context they can use. It’s also clear that humans don’t always do particularly well, with many factors playing into performance (known/unknown conversation topic, speaker familiarity, language knowledge, etc.). But currently the strengths of humans and machines seem to lie in different tasks, with humans geared towards live, open-vocabulary, in-context situations. Maybe it was unfortunate for Jill Scott, then, that it was millions of humans watching the Euros.
References:
[1] Hughes, K. (2022, August 1). BBC apologise after England star Jill Scott’s X-rated blast as Lioness is fouled by German during Euro 2022 final. The Sun. https://www.thesun.co.uk/sport/football/19383140/bbc-apologise-england-jill-scott-euro-2022/
[2] Fernandez-Lopez, A., Martinez, O., & Sukno, F. (2017). Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database. arXiv:1704.08028
[3] Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53–72. https://doi.org/10.1016/j.imavis.2018.07.002
[4] Condliffe, J. (2016, November 21). AI Has Beaten Humans at Lip-reading. MIT Technology Review. https://www.technologyreview.com/2016/11/21/69566/ai-has-beaten-humans-at-lip-reading/
[5] Assael, Y., Shillingford, B., Whiteson, S., & De Freitas, N. (2016). LipNet: End-to-End Sentence-level Lipreading. arXiv:1611.01599
[6] Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421–2424. https://doi.org/10.1121/1.2229005
[7] McGurk, H., MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264,746–748.
Want to get in touch? Want to contribute with your own articles, puzzles, or linguistics memes? https://u-lingua.carrd.co