Soccer aficionados will never forget the headbutt by French soccer great Zinadine Zidane during the 2006 World Cup final. Caught on video camera, Zidane’s attack on Italian player Marco Materazzi after a verbal exchange got him a red ticket. He left the field, making it easier for Italy to become world champions. The world found out later about Materazzi’s abusive words of Zidane’s female relatives.
“If we had good lip-reading technology Zidane’s reaction could have been explained or they would’ve both gotten sent out,” says Helen Bear, a computer scientist at the University of East Anglia in Norwich, UK. “Maybe the match outcome would be different.”
Bear and her colleague Richard Harvey have come up with a new lip-reading algorithm that improves a computer’s ability to differentiate between sounds—such as ‘p’, ‘b,’ and ‘m’—that all look similar on lips. The researchers presented their work at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in Shanghai.
A machine that reliably reads lips would have uses beyond sport rulings, of course. It could be used to solve crimes or analyze car and airplane accidents based on recorded footage, Bear says. It could help people who go deaf later in life, for whom lip-reading doesn’t come as easily as to those who are born with the impairment. It could also be used for better movie dubbing.
Lip-reading, or visual speech recognition, involves identifying shapes that the mouth makes and then mapping those to words. It is more challenging than the audio speech recognition that are common today. That’s because the mouth assumes between 10 and 14 shapes, called visemes, while speech has 50 different sounds called phonemes. So the same viseme can correlate to multiple phonemes.
Bear and Harvey have developed a new machine-learning algorithm that more precisely maps a viseme to one particular phoneme. The algorithm involves two training steps. In the first, the computer learns to map a viseme to the multiple phonemes it can represent. In the second, the viseme is duplicated—say three times if it looks like ‘p’, ‘b’ and ‘m’—, and each copy trains on just one of those sounds.
The data to train the algorithm came from audio and video recordings of 12 speakers (7 men and 5 women) speaking 200 sentences. Bear used known computer vision algorithm that extract the shape of their mouths. She then labeled the extracted data with appropriate visemes and the audio data with phonemes and fed it to her training algorithm.
The algorithm identifies sounds correctly 25% of the time, an improvement over past methods, Bear says. And it recognizes words 5% better than on average for all speakers, which, Bear says, is a significant increase given the low accuracy of speech-recognition systems that have been developed so far.