Shimi Will Now Sing to You in an Adorable Robot Voice

Deep learning helps one of Georgia Tech's musical robots to understand humans and sing to them

4 min read

Evan Ackerman is IEEE Spectrum’s robotics editor.

Shimi robot
Photo: Georgia Tech

Human-robot interaction is easy to do badly, and very difficult to do well. One approach that has worked well for robots from R2-D2 to Kuri is to avoid the problem of language—rather than use real words to communicate with humans, you can do pretty well (on an emotional level, at least) with a variety of bleeps and bloops. But as anyone who’s watched Star Wars knows, R2-D2 really has a lot going on with the noises that it makes, and those noises were carefully designed to be both expressive and responsive.

Most actual robots don’t have the luxury of a professional sound team (and as much post-production editing as you need), so the question becomes how to teach a robot to make the right noises at the right times. At Georgia Tech’s Center for Music Technology (GTCMT), Gil Weinberg and his students have a lot of experience with robots that make noise of various sorts, and they’ve used a new deep learning-based technique to teach their musical robot Shimi a basic understanding of human emotions, and how to communicate back to those humans in just the right way, using music.

Shimi has been around since 2012, and initially, it was programmed to play music that it would analyze to be able to dance along with a its few (but effective) degrees of freedom. Getting Shimi to recognize and respond to human emotions has been a much more difficult task. The fact that Shimi doesn’t use words makes it more difficult for the robot to say the wrong thing, but it makes saying the right thing more difficult as well. 

Training Shimi to be able to improvise effectively in voice, tone, and motion has involved training it with lots and lots (and lots) of data. To develop Shimi’s behaviors, a deep neural network ingested the following:

  • 10,000 files from 15 improvisational musicians playing responses to different emotional queues
  • 300,000 samples of musical instruments playing different musical notes, to add musical expressivity to the spoken word
  • One of the rarest languages in existence—a nearly extinct Australian aboriginal vernacular made up of 28 phonemes

That last thing jumps out a little bit, doesn’t it? But it’s key to Shimi’s vocalizations, because it’s how the robot makes sure that nobody (or, almost nobody) can ascribe any direct meaning at all to the noises that it’s making. For more on this, we spoke with Gil Weinberg via email.

IEEE Spectrum: Can you tell us more about using “a nearly extinct Australian aboriginal vernacular made up of 28 phonemes?”

Gil Weinberg: To generate vocalizations that focuses on emotions devoid of all semantic meaning, we chose to construct a new vocabulary based on the Australian aboriginal language Yuwaalaraay, a dialect of the Gamilaraay language. Richard Savery, one of the students who worked on this project along with Ryan Rose, researched languages from his native country of Australia, and found that from 250 aboriginal languages that existed before the European invasion, today only 13 are not extinct or close to be extinct. The Ywaalarraay language is a great fit for our project since it has no written tradition and is mostly based on sound and prosody. It also allows us to evaluate our approach with Western subjects without biasing them with any semantic connotations.

Why not have Shimi speak in real words? How does that decision fit in with the rest of Shimi’s design?

There are a couple of issues we see with robots that attempt to look like and sound like humans. If they become similar to humans, but not just quite, they fall into the “Uncanny Valley,” were they are perceived as eerie and revolting. If they are able to pass the Uncanny Valley and become indistinguishable from humans, it can lead to ethical issues. Humans, in general, would like to know if they are talking to a robot or a human (see for example Google Duplex—a software-based robot that can make phone-calls to restaurants, fooling their human counterparts to think they are talking with a human).

We therefore try to make our robots distinguishable from humans in they way they communicate. Since music is such as emotive and expressive medium, at our Robotic Musicianship Lab at GTCMT, we are in particular excited about exploring new music-driven approaches to communicate emotions and moods, both in voice and gestures. Shimi, just like our other robots, doesn’t try to look and sound like a human. This allows us to explore interesting research questions such as how to use only few degrees of freedom, and non linguistic voice to communicate expressively with humans.

How does Shimi analyze the emotions of humans that it interacts with, and how good is it at doing so?

For emotion detection we use the two-dimensional emotion classification model of valence and arousal. We determine valence by semantically analyzing spoken language, looking for words that represent positive and negative feelings. For arousal, we analyze the human voice for loudness, intensity, pitch contour, and signal energy. The classification is pretty reliable, although currently it only classifies four basic emotions on the valence-arousal grid: happy, sad, angry and calm.

Do you think the kind of musical expressiveness that Shimi has learned would be appealing across ages and cultures, or would you need different training data for the robot to be more effective?

This is a great question, which we hope to explore in the very near future. We just submitted a research proposal that includes the creation of a large musical dataset, from a wide variety of cultures around the globe, with emotionally tagged utterances that could help us learn about the correlation between music, prosody, and emotions between different cultures. I promise to get back to you on that once we have results

What’s the long term goal of this kind of research?

In a world where we are surrounded by robots who need to communicate their state of mind, mood and emotion, prosody, and gestures seem like a great subtle back channel to do so, since humans cannot really process multiple linguistic channels effectively. So our long term goal for this research would be to scale our system to large groups of robots.

[ GA Tech ]

The Conversation (0)