Steven Cherry: Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”
Check this out—listen carefully, because it’s short:
That person sounded sad, right? Let’s try another one.
Six hundred five!
Definitely not sadness. Could you tell? It was pride. Listen again.
Six hundred five!
How can we hear emotions? It’s not what people are saying—here it’s just numbers, a date in one case—it’s how they’re saying it. And we can all do this. In fact, not being able to do this is a psychological disability. Well, if we can do it reliably, the next step—this is the Information Age, after all—is teaching a computer to do it too.
A big step in that direction was announced at the IEEE Workshop on Spoken Language Technology, held in Miami earlier this month.[December] A paper there with the imposing title “Speech-Based Emotion Classification Using Multiclass SVM With Hybrid Kernel and Thresholding Fusion,” by six coauthors, described software that can detect six different emotional states with 81 percent accuracy.
Some of the research was done during a summer internship at Microsoft Research, by lead author Na Yang, a fourth-year grad student at the University of Rochester. Her Ph.D. supervisor, Wendi Heinzelman, a professor of electrical and computer engineering, was one of the paper’s coauthors and is my guest today. She joins us by phone.
Wendi, welcome to the podcast.
Wendi Heinzelman: Thank you for having me.
Steven Cherry: Wendi, the strategy here involved chopping up speech into intervals of 60 milliseconds, capturing two variables, loudness and pitch, and then measuring them in a bunch of different ways. You looked at their absolute values and how much they changed from one interval to the next, and then into something the paper calls “formants.” What are formants?
Wendi Heinzelman: So, formants are basically the frequency content. So we’re looking at, as you said, pitch and energy, but we’re also looking at kind of the frequency and bandwidth information in the voice signal as well.
Steven Cherry: Emotion seems so complex, yet it comes down to just two things: Loudness is just the volume in decibels, I guess, and pitch is where a sound is on a scale. So you’re looking at a sound’s quality and quantity in the most basic ways. According to the paper, other related research has been more complicated. Were you surprised at how much mileage you could get from the simpler approach?
Wendi Heinzelman: We were. Actually, the features, it’s not just the energy and pitch, but it’s really looking at the changes over time. So if you just look at a segment of pitch and a segment of energy, that really wouldn’t tell as much as how they change over time—what’s the mean and the standard deviation, which, as you said, the range we’re looking at as well. All of this information combined is what gave us a model. We can basically train our model to learn what those features look like in angry speech and in happy speech, etc.
Steven Cherry: One advantage to the simpler approach is that the software can be run on a smartphone; there’s an app. The computational load isn’t too large. . . .
Wendi Heinzelman: Exactly. There were a couple of reasons we went with this approach first. There’s actually an inherent privacy that this gives you. We’re not actually looking at what people say, and we don’t need to record what people say, so rather than actually recording your voice, we can actually directly record the pitch characteristics and the energy and some of the frequency content that we need, and we never have to store your voice and we never have to save your voice.
So, originally, this research was started to help psychologists do a better job of analyzing interactions between humans, between a husband and a wife, or between parents and a child, to determine the emotional content in those interactions. And so rather than having to record everything that’s said, we wanted to figure out if it’s possible to just look at how it’s being said and look at the features and determine what emotions are being displayed, so it gives us a little bit of privacy. So it cuts down on storage, it cuts down on transmission, and it can save power, and then finally, as you note, this allows the computation to be much faster than in other approaches that use whole segments of speech or use other methods, such as people have looked at video to do facial recognition or other types of inputs to try and detect emotions. So we can do it relatively quickly.
Steven Cherry: So you started with a speech database, from seven professional actors, from, I guess, about 150 utterances in all. Now, you were measuring accuracy within that database. What I mean is that you can’t yet detect the different emotions in someone else’s speech, like mine right now, at 80 percent.
Wendi Heinzelman: That’s exactly correct. So we tested the algorithm, we trained the algorithm, on the speech from the actors in this database, where we have the ground truth. We know what the emotion is, and then we also test it on ones that we haven’t used as part of the training set, and, of course, vary which ones you use as training data and which ones you use as testing data, to make sure you didn’t just happen to use a really bad set for training and testing and vice versa.
So all of those results are based on the samples in that database, where we have ground truth. One of the really difficult things that we’re finding, and that other researchers—there’s a lot of people who look at this field and who study it from different perspectives. One of the hardest things actually is getting ground truth. Because you need to know what it is about anger that makes it anger, how to train your system on those particular speech characteristics, and it’s really difficult to get ground truth. So we are actually working with our psychologists right now to use human coders and psychology students right now to actually act out emotions or to have real interactions that they’re studying in their lab, and have human coders go back and review those interactions and actually hand label what the emotions are.
Now, this is actually how psychologists get emotions out right now: They train human coders, and typically I think the gold standard is to use three human coders for every interaction that they want to code, and as long as two of them agree, they select that as the emotion. But inherently it’s a difficult problem.
Steven Cherry: Now, the emotions were anger, sadness, disgust, then there was a neutral state, happiness, and fear. Can you say anything about how, say, disgust is different from anger, on the one hand, and sadness on the other?
Wendi Heinzelman: People who study emotions have put these on scales of positive emotion versus negative emotion, active positive that’s the—there’s a lot of ways people look at emotion and actually try to define emotion, and that in and of itself is sort of a whole research topic. I like the scale where you look at emotions as positive, negative, active, and passive, because that really describes a lot of what we’re looking for. Is it something that’s a good emotion? A positive emotion? Something like happiness or excitement? Or is it a negative emotion? Something like disgust or fear. And then active versus passive is really how much sort of emotional and physiological involvement you have in that emotion. Is it content, for example, is a positive emotion but not very active, whereas excitement clearly is a very active emotion and also positive.
Steven Cherry: Back in 2005, I wrote a piece for the magazine about customer call centers that were using software to detect anger in the voices of callers. I guess if anger rises during a call, that signals a supervisor that a call is going badly. I guess soon our cars will be able to detect road rage, for example, with an app like this?
Wendi Heinzelman: Right, and that would be a fabulous application for something like this. And, again, road rage is something that’s very negative and very active, so the ones that are the more stronger emotions tend to be the easier ones to detect. And so an application that looked at really anger, a strong anger, sometimes it’s called “hot anger,” versus not-hot anger is easier to do than something that’s looking at this range of emotion.
And, in fact, and the way we actually do our emotion classification is we have a bunch of individual classifiers, they’re called “one against all,” where we have a classifier that said is this emotion anger or not anger? Is it sadness or not sadness? Is it disgust or not disgust, etc.? And so those classifiers come out with sort of a decision of whether this emotion should be classified as anger and with what confidence it should be classified with anger. Now, what we have to do is, following that, we have to fuse all of those individual classifier-level decisions into a final decision, because you can imagine, for example, that something might come out as “yes, this is anger” and “yes, this is disgust,” because if you notice, they’ve got very similar characteristics between anger and disgust. So we have a fusion algorithm that tries to figure out, well, which one is more confident? Which classifier gave more confidence that it really is that emotion?
Steven Cherry: Your paper also involves maybe a therapist or clinician using an app like this to diagnose psychological disorders. Do you think that would work out?
Wendi Heinzelman: Well, that’s exactly what we’re trying to use this for. As I mentioned before, right now, when psychologists want to study human interactions, they’ll bring in people into the lab and have them maybe have a discussion. So, for example, the psychologists we’re working with are studying the interactions between teenage children and their parents, and they’ll bring the child and the parents into the lab and have them discuss a happy topic, something like someone’s birthday or a vacation they just took. And then they’ll discuss a topic that should elicit stress, so maybe staying out past midnight, or spending too much money, or not doing well in school, or something that the psychologists feel will elicit sort of negative emotions.
And right now, the way that they have to analyze these situations is have human coders go in and look at all of the data and figure out what emotions were being present at different points in time in the interaction. So we’re trying to replace that with microphones put into the room that we automatically detect a voice, and automatically detect what the emotions are, and automatically code those. Now, the gold standard for psychologists, my understanding in speaking with my psychologist colleagues, is that rather than bringing people into the lab, they really want to know how people interact and how they sort of have discussions and sort of have arguments in their natural environments, i.e., in the home, or out at the supermarket, or in the car.
They’d like to have this information in people’s daily lives, and obviously people don’t want you going in and putting cameras in their homes and recording everything they say, so one of our goals is if this system becomes accurate enough, then we could put microphones in the home or use a smartphone type of application, and, again, not record what people are saying, but record the features of the voice so we can figure out what the emotion levels are at different points in time.
Again, right now we’re only at 80 percent, so this is not 100 percent accurate. This is not something you will guarantee. When it tells you this is the emotion, you’re not guaranteed to have that emotion. We’re getting closer and doing more research, and there’s still a lot of folks out there in the emotion-classification research area who are working on this, so I do believe we will get here, but we’re not there yet.
Steven Cherry: One thing you mentioned was the true ground problem—it’s kind of like a chicken-and-egg problem.
Wendi Heinzelman: Right, right.
Steven Cherry: It did occur to me that although you’re using actors now in developing the app, someday actors could use an app like this to train themselves to display emotions better.
Wendi Heinzelman: Oh, interesting, yeah.
Steven Cherry: The work was shown at this IEEE workshop on Dec 5. Were you in Miami for the presentation?
Wendi Heinzelman: No, my graduate student Na Yang presented, and presented a demonstration system.
Steven Cherry: Have you talked to her? Do you know how it went?
Wendi Heinzelman: Yeah, she said it went very well. There were a number of people doing similar types of applications with either emotion or using similar techniques with training systems and models for doing speaker recognition, and even speech recognition and understanding, and things like that. So there was a lot of interesting research going on, and she said she learned a lot about sort of the latest and greatest in speech technology.
Steven Cherry: Excellent. Well, thanks, Wendi, to you and Na Yang, and to the rest of the team. It’s a great piece of research, and thanks for being on the show to tell us about it.
Wendi Heinzelman: You’re welcome. Thank you for having me.
Steven Cherry: We’ve been speaking with University of Rochester professor Wendi Heinzelman, who’s part of a team of researchers there teaching computers to hear the emotions in human speech.
For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.
NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.