With satellite, cable, and terrestrial TV stations broadcasting in the hundreds and Internet-based entertainment content companies also competing for viewers’ attention, finding something to watch is, strangely, a growing challenge. To help simplify the task, researchers at Japan’s public TV and radio broadcaster Nippon Hoso Kyokai, better known as NHK, plan to begin testing technology to automatically assess in real time a viewer’s interest in a TV program or video and then suggest other programs to watch based on the results.
To gauge what a viewer is interested in, NHK’s system uses a Microsoft Kinect motion and depth sensing input device. The Kinect, incorporating Microsoft’s face tracking software development kit (SDK), feeds images of the view to several software modules in a PC.
The first indication that a viewer is interested is that he is actually in front of the TV. So one module, used to sense whether a viewer is present, extracts “keypoint” trajectories—measured points of the person movements—from a sequence of video frames. About 200 such trajectories can be extracted from a single frame. Features from the trajectories are converted into code words and employed to train a machine-learning program to identify the presence of a viewer.
Two other modules working in parallel respectively estimate the viewer's 2-D and 3-D head poses based on color and depth images taken by the Kinect. The results of all three modules are then combined to estimate whether the viewer is gazing at the screen or not.
“A viewer’s gaze at the screen is important for rating a program’s content,” says Masaki Takahashi, principal research engineer in NHK’s Integrated Broadcast-Broadband Systems Research Division. “Other face-detection-based person detectors tend to fail when a person turns away from the camera. Keypoint trajectory technology is more suitable because it contains a temporally long-term history of a keypoint.”
Two further modules have recently been added to recognize facial expressions. One module estimates the intensity of six primary expressions—smiling and surprise being the most effective. The other module judges the presence or absence of facial expressions by comparing parameters of skin movements against an image database of known facial-expressions.
Based on the viewer’s estimated level of interest while viewing a program, keywords are extracted from the program’s closed caption text and listed on a tablet computer together with emoticons representing any facial expressions detected at the time. (About 70 percent of general NHK programs provide closed captioning.)
Key words are extracted from the captioning using what’s called morphological analysis. “Proper nouns like people’s names and place names are good candidates for the viewer’s interest,” says NHK senior research engineer Simon Clippingdale. “We link them to the Wikipedia database and to the program’s home page for follow-up searching. We’re also developing a TV program navigation system based on the viewer’s interest.” The system uses an automatically generated TV program map that links a large volume of TV shows and Japanese vocabulary using several kinds of semantic relationships. This would enable a viewer interested in the listed keyword "tempura", for example, to view several related words on the tablet (such as, ingredients, restaurants, or regional specialties) and these in turn could provide links to the programs they are semantically associated with.
This autumn the group will begin testing viewers’ interest in the first stage of the system: providing keywords with links to Wikipedia and the program’s home page. “We want to see how well the system works in ordinary homes and what kind of interest users have,” says Takahashi. “Then we hope to begin testing the program navigation system.”
Several major technical challenges remain before our TV will know what we want to watch better than we do. For one thing, the system NHK has developed so far is for a single user only and so needs to be extended to include a whole family. A related issue is how to distinguish when a viewer is, for instance, enjoying the program or laughing at a companion’s joke.
But Takahashi is confident these challenges can all be overcome. “The technology should be ready for use in two to three years,” he says.