Why the Kinect Connected With Game Players

The Kinect is the fastest-selling consumer product of all time, thanks to Microsoft Research

Loading the podcast player...

Steven Cherry:

Hi, this is Steven Cherry for

IEEE Spectrum

’s “This Week in Technology.”

Aided by the holiday shopping season, Microsoft sold 10 million units of its new Kinect in three months. According to the folks at Guinness World Records, the Kinect is by far the fastest-selling consumer product of all time.

Just to put these numbers in context, Apple sold 3.3 million iPads in its first 90 days, one-third as many as Kinect. To be sure, the size of a retail blockbuster seems to only get bigger—the original iPhone was a huge success at 1 million phones in its first full quarter, and back in the medieval era of the 1990s, a mere 350 000 brave homeowners bought a DVD player in its first year.

The way the Kinect has connected with gamers is quite an achievement for a company better known for its software than hardware. And yet the story behind the product’s development is almost more remarkable. It starts with Microsoft itself, which quietly employs 850 Ph.D.s, from software engineers to sociologists, in labs in six countries on three continents. Apple, by contrast, invests no money at all in basic research. For the original iPod, for example, product designers and engineers there combined a 1.8-inch hard drive invented by Toshiba, a scroll wheel invented by Synaptics, software from one tiny company, and an interface largely designed by another.

The reason companies like Apple don’t put money into basic research is that the technology breakthroughs that come out of research labs often benefit competitors almost as much as they do the companies that do the work. They can take a decade or more to work their way into products, and many of them never do. That didn’t happen with the Kinect.

My guest today is Alex Kipman. He’s the general manager of incubation for the Xbox and was a key member of the Kinect development team. Though he’s in the Xbox group and not in Microsoft Research, which is a separate organization, he’s the holder of more than 60 patents and an IEEE member. Alex, welcome to the podcast.

Alex Kipman: Thank you for having me. Happy to be here.

Steven Cherry: Alex, before we get to the story behind the device, let’s get to the Kinect itself. I’m going to ask you to explain it, because it’s not the easiest device to explain.

Alex Kipman: The way to look at it is, technology has really complicated our lives and put incredible new demands on it. With Kinect, you can really just step in front of the sensor, and as soon as you do the sensor recognizes you; it knows the difference between you, your loved ones, and your family. As you start moving, it understands your body movement. It sees you head to foot, and it tracks every joint in your body, so if you see a ball and you want to kick it, you can simply kick it. There is no interface to it; you can just do what comes natural and normal for you in real life. Finally, we have voice recognition, so that if you see something, you can just say it. You can be watching a movie through Zune or Netflix, and you can just say something like “Xbox pause” or “Xbox play,” “Xbox fast forward,” or any number of commands to allow you to interact with your entertainment experiences in a more natural and easy way.

Steven Cherry: So when you say the sensor we’re talking about—sort of like a camera that detects—well, first of all, it recognizes you, and then it detects your motion and interprets those motions through software, and then there’s also voice recognition as well?

Alex Kipman: Yes, the thing about it is what we’ve done is we have a sensor. The sensor—you can think of it as introducing eyes and ears into the system. We have several eyes—some that see in color, some that see in the infrared range so that we can see in ambient light in variant ways. We don’t care if it’s brightly lit or a superdark environment; we still see through appropriately. And we have four microphones that serve as ears so that we can appropriately position sounds from different sound sources, different mouths in the room—even though in living rooms people will be hopefully having fun, at which point there’s a lot of ambient noise including loud noises coming out of speakers, and we can still break through all of that noise and focus on the signal of the people speaking. Now, in a sense, the eyes and ears aren’t really that interesting, and they’ve existed for quite a long time. The thing that makes Kinect really innovative and new is how we’ve associated a brain with the eyes and ears so that we can take all of the noise of the real world and translate it into a strong signal that once it’s gone through the different parts of the Kinect brain can transform into the simple and intuitive experiences that we brought to market last November.

Steven Cherry: Yeah. So I guess I’ve started to think about the Kinect as a superintelligent joystick without the joystick. Maybe you could just sort of tell us about the superintelligence part.

Alex Kipman: Sure. The brain behind Kinect is in a way a series of sophisticated algorithms that will range from machine learning to computer vision to signal processing design, to really take any number of these noisy environments, be it the acoustics of the room or the visual characteristics of the room, and translate that into identity recognition—being able to know who you are into full body motion, being able to understand precisely the motion of your body and ultimately to be able to translate noise into voice recognition.

Steven Cherry: So what would you say are the key technologies that came out of the research organization?

Alex Kipman: On voice recognition: There exists no system before Kinect whereby you don’t have to push a button to talk. If you think about your phone, if you think about cars that have voice recognition, if you think about voice recognition on a PC, these are usually systems where you need to press a button before the system starts listening to you. In our case, there are no buttons, there are no controllers; the system is listening to you 100 percent of the time. If you’re in the business of doing speech recognition, errors will usually compound over time, which is why you have push-to-talk systems. In our case, the system can be on for days, for weeks, for months, and it’s always listening. From that perspective, that’s a significant challenge in terms of speech recognition design, something that we had to solve in partnership with research. If you think about voice recognition systems, you know people are usually a meter or a couple a feet away from the microphone. They’re captive—they’re in front of a steering wheel, they have a phone near their mouth, or they’re in front of a PC with microphones not more than a foot away from them. In our case, people are usually standing, you know, 12, 20 feet away from the television; they’re standing in different positions in the room, and we need to be able to be informed to be able to localize the sound to their mouths at these large distances. Again, that’s a problem in science-fiction land that we partnered with research deeply to be able to achieve with Kinect. Finally, most voice recognition systems quiet the environment to listen to the person speaking. When you do that—press that button to talk, everything mutes, and the environment is expected to be quiet before you can actually speak. In our world, we have over 90 dbs [decibels] of ambient noise—this is either because people are speaking or having fun in the living room or you have noise coming out of the speakers. And still while you have explosions in the movie coming out of your 7.1 speakers, you should still be able to say “Xbox pause” without pressing a button at about 12 feet distance. And the system should still recognize you and do the appropriate thing. So the Kinect system in a way works very similarly to your brain in that it uses historical information of learned behavior in its past to be able to comprehend the present and what it’s seeing right now, based on a large extent on preassociations that say, “Hey, based on my previous history, I can associate what I’m seeing in the present.” And then finally in the identity recognition realm—you know, identity recognition is a field of research that has existed for quite some time, and it’s not surprising that you haven’t seen consumer electronic products that have enabled this feature in the past. Darwin tends to be against you in living rooms. If you’re trying to identify people, they’re actually genetically very similar—they tend to be siblings, brothers, sisters—so if you’re trying to do traditional facial-recognition-type algorithms, you’re going to find out that they work in the lab; they don’t work so great when you’re varying the ambient light conditions, when people are sitting far, far away, when they’re actually siblings, and on and on. Our system, again, has to be able to not only recognize siblings, but it has to recognize them at about 12 feet plus away from the television in semidark rooms.

Steven Cherry: You know, I’m struck by how much of what you describe is sort of finding the signal in the noise, almost like literally in the case of the voice recognition but then also in distinguishing between a random motion and an actual gesture in a game.

Alex Kipman: You’re absolutely correct.

Steven Cherry: I’m struck also by the similarity to what IBM did with the Watson computer that played “Jeopardy!” recently. In that case you could actually see on the screen the answer that it gave with a probability assigned and then sort of like the two leading alternatives and their probabilities as well. Does something like that behind the Kinect as well?

Alex Kipman: In a way what we’re trying to do is move technology from this world where we’ve had to understand it and deal with it in an arcane way, to a world where it kind of disappears and it kind of more fundamentally starts understanding us. Now for this to happen, you’re essentially, through different means, trying to make technology work similarly to how humans work. Our world, the real world, the nondigital world, the world that you and I live in, is an analog world. An analog world, to the point that you were making earlier, is a massive signal-to-noise space, meaning there is a lot of noise around us every single day. And what makes us human, what makes us intelligent, is that we have this amazing machine called the brain that’s able to focus on the signal on the given moment in time and kind of pad away the noise. The way that you do that, in our world, the way that we work is not in a digital binary way. It’s not about zeros and ones, or trues and falses, or cause and effect. Instead of black and white, it’s about the infinite shades of gray between. Instead of true and false, it’s about maybes. Instead of yes and no, it’s about probably. In this world, much like you’re alluding to, you don’t know anything. The way that you and I work is not because we’re certain but because we speak this very complex language of uncertainties. We are able to look at a situation and apply all of our previous learning to assign a probability distribution of likelihood which allows us then to interact with the present. This is, I imagine, I don’t know, what the IBM folks did with Watson, this is in fact what we have done with Kinect.

Steven Cherry: I guess this is happening all over Microsoft—there’s the way that the new phone software interprets keystrokes, which sort of assigns probabilities to kind of guess what word you’re trying to type. Bing, the search engine, does the same thing when it tries to figure out sort of in what realm you’re trying to make a query and what are the best answers to that. Are we going to see more and more of this kind of probabilistic theory making its way into software, kind of making devices easier to use?

Alex Kipman: I hope so. The three examples you used are absolutely accurate. All three of those systems use machine-learning-based algorithms, which are probabilistic in nature, to make those features happen. And from a Microsoft perspective, I think we believe, I can certainly say that in Xbox we believe that that’s the way forward. That’s the way to make technology simpler, and that’s the way to create technology that users have an easier and more useful way of interacting with.

Steven Cherry: Very good. Maybe that’s a good place to leave it. Thanks a lot for joining us.

Alex Kipman: Thanks so much for having me.

Steven Cherry: We’ve been speaking with Alex Kipman, who was a key member of the development team for the Kinect, which this past winter shattered all product sales records by selling 10 million units in its first three months.

For IEEE Spectrum’s “This Week in Technology,” I’m Steven Cherry.

This interview was recorded 22 March 2011.
Audio engineer: Francesco Ferorelli
Follow us on Twitter @spectrumpodcast