This is a guest post. The views expressed in this article are solely those of the blogger and do not represent positions of Automaton, IEEE Spectrum, or the IEEE.
When Microsoft was developing its Kinect 3D sensor, a critical task was to calibrate its algorithms to rapidly and accurately recognize parts of the human body, especially hands, to make sure the device would work in any home, with any age group, any clothing, and any kind of background object. Using a computer-based approach to do the calibration had limitations, because computers would sometimes fail to identify a human hand in a Kinect-generated image, or would "see" a hand where none existed. So Microsoft is said to have turned to humans for help, crowdsourcing the image-tagging job using Amazon's Mechanical Turk, the online service where people get paid for performing relatively simple tasks that computers are not good at. As a result the Kinect now knows what all (or most) hands look like. Great!
Well, that's great if all you care about is gesture-based gaming, but from my commercial robotics-oriented perspective, the problem is that a human hand is just one "thing" among thousands -- millions?! -- out there that we would like machines to be able to identify. Imagine if a robot could promptly recognize any object in a home or office or factory: Anything that the robot sees or picks up it would instantly know what it is. Now that would be great.
So the question is: Can we ever achieve that goal? Can we somehow automate or crowdsource image tagging of almost every object imaginable?
This type of data collection presents a chicken-and-egg problem: If you have a data set with objects properly tagged, you can start to build applications that rely on the "knowledge" stored in that set, and these applications in turn can generate more data and you can refine the "knowledge" further. The problem is, you need a data set in the first place! Sometimes companies decide that there's a compelling value proposition in building such a set. That's what Microsoft did with the Kinect. Another example is Google's "voice actions," which let you search, email, and do other tasks using speech. Every time you say a word and your Android phone asks, "Did you mean…?" and gives you a list of words to select from, you're helping to improve Google's voice recognition system. Over time, the variation and nuances of different people's speech patterns are being captured as voice data that could match an actual regression. Speech to text would never be any good without this kind of continuous improvement.
Now back to robotics. What I think the robotics community should pay more attention to is the importance of data. There are many things in robotics that require a large data set (emphasis on large) in order to become technically feasible (such as recognizing objects) and therefore this functionality is outside the hands of pure research, roboticists, and algorithms, and more in the hands of current market trends with "tangential technologies" such as the Web and smartphones. So, in order to make robots "happen" one day, we need to keep an eye out for those technologies that have the potential of collecting lots of data, for reasons other than robotics, and apply it to robotics when the time is right.
And the type of data we need the most is 3D. So how do we collect 3D data for every possible object? Luckily, a large hackercommunity formed around the Kinect sensor, and startups like MatterPort are enabling quick 3D rendering of objects just by taking images with the Kinect at a few angles. The results are still crude, but as sensors and algorithms improve, you can imagine that "3D-fying" a scene will become as easy as snapping a picture of it. In fact, technologies like the Lytro and other "computational cameras" that capture both intensity and angle of light, allowing users to refocus already-snapped photos, could also help with the creation of 3D images. Here's a demo of a Kinect-based system from MatterPort:
As I said before, roboticists alone can't do all the 3D scanning. The hope is that other technologies would drive this trend. So here's an idea. If online retailers saw value in showcasing detailed 3D models of objects for sale (instead of the usual 2D photos we see on most websites), and tagged images with descriptions such as color, weight, and function, then thousands of objects could be in principle searchable by a robot. Google discussed a notion similar to this at the 2010 IEEE Humanoids conference and again at Google I/O last May. And maybe not only retailers would offer 3D scans, but consumers would too, realizing that adding 3D views would be a more effective way of selling stuff on eBay, for example.
If this scenario becomes reality, then all of the 3D images could be aggregated into a robot-friendly database that bots would use as reference. A robot would take 3D sensor data of an object it is seeing and check whether it matches one or more of the reference images. Over time, and with feedback ("Yes, Rosie, this is a plate"), the robot's object-recognition capabilities would continually improve.
So you want smarter robots? Then start demanding that online retailers offer 3D scans of their products -- and start creating your own scans. With this data set, robots will finally start to be able to recognize and understand our world of objects.