We live in a world of sounds, full of beautiful music, birds chirping, and the voices of our friends. It's a rich cacophony, with blaring beeps, accented alarms, and knock-knock jokes. The sound of a door opening can alert us to a friend's arrival, and a door slamming can alert us to an impending argument.
HEARBO (HEAR-ing roBOt) is a robot developed at Honda Research Institute–Japan (HRI-JP), and its job is to understand this world of sound, in a field called Computational Auditory Scene Analysis.
At the IEEE International Conference on Intelligent Robots and Systems and RO-MAN this year, several papers describing HEARBO's latest functionalities were introduced.
With the dream of the futuristic robotic butler, researchers are trying to make robots understand our voice commands, a bit like Apple’s Siri but from 2 meters away. Typical approaches use a method called beamforming to "focus" on a sound, like a person speaking. The system then takes that sound, performs some noise reduction, and then tries to understand what the person is saying using automatic speech recognition.
The beamforming approach is widely used, but HEARBO takes the beamforming approach a step further. What about when the TV is on, the kids are playing on one side of the room, and the doorbell rings? Can our robot butler detect that? HEARBO researchers say it can, using their own 3-step paradigm: localization, separation, and recognition. This system, called HARK, lets you recover the original sounds from a mixture based on where the sounds are coming from. Their reasoning is that "noise" shouldn't just be suppressed, but be separated out and then analyzed afterwards, since the definition of noise is highly dependent on the situation. For example, a crying baby may be considered noise, or it may convey very important information.
At IROS 2012, Keisuke Nakamura of HRI-JP presented his new super-resolution sound source localization algorithm, which allows sounds to be detected to within 1-degree of accuracy. For example, it could precisely detect the location of a human calling for help in a disaster situation.
Using the methods developed by Kazuhiro Nakadai's team at HRI-JP, up to four different simultaneous sounds or voices can be detected and recognized in practice. Theoretically, with eight microphones, up to seven different sound sources can be separated and recognized at once, something that humans with two ears cannot do.
Understanding the mixture
Nakamura has also taught HEARBO about the concepts of music, human voice, and environmental sounds. For instance, HEARBO was trained with many different music songs to learn about the general characteristics of music. Upon hearing a song it’s never heard before, it can say, “I hear music!” This means that HEARBO can tell the difference between a human giving it commands, and a singer on the radio.
In this video, three sound sources are positioned around HEARBO: to the robot's right is a beeping alarm clock; in front of it, a speaker plays music; to its left a person speaks. The first thing the robot does is to capture all the sounds, recognize them, and determine the location they're coming from. Then HEARBO focuses its attention on each source one by one. Watch:
Robots also have some special robot-y hurdles to overcome. For example, when robots move, their motors whir and grind, distorting the sound they're trying to hear. Just like how our human hearing system filters out the sound of our own heartbeat, robots need to remove this self-generated noise. This is called ego-noise suppression and is the research focus of Gökhan Ince. How do they do it? They embed microphones in HEARBO's body, and subtract that internal noise in real-time from what the robot hears through its head mics. This robustness to noise is essential for helping robots decipher their world.
In this new dancing robot video developed in collaboration with INESC Porto and LIACC-FEUP, even when HEARBO is listening to the music, grooving its motors to the beat, it can still understand its human's commands. This robot is not just great dancer, but shows off the capabilities of a hearing robot adeptly handling a plethora of sounds, just like in real-life.
"Real-time Super-resolution Sound Source Localization for Robots" by Keisuke Nakamura, Kazuhiro Nakadai, and Gökhan Ince from Honda Research Institute, and "Live Assessment of Beat Tracking for Robot Audition" by Joao Lobato Oliveira, Gökhan Ince, Keisuke Nakamura, Kazuhiro Nakadai, Hiroshi G. Okuno, Luis Paulo Reis, and Fabien Gouyon from INESC Porto, LIACC-FEUP, HRI-JP, and Kyoto University were presented at IROS 2012 in Villamoura, Portugal.