Covariant Uses Simple Robot and Gigantic Neural Net to Automate Warehouse Picking
A massive neural network connects cameras, a robot arm, and a suction gripper in Covariant’s logistics system
Two years ago, we wrote about an AI startup from UC Berkeley and OpenAI called Embodied Intelligence, founded by robot laundry-folding expert Pieter Abbeel. What exactly Embodied was going to do wasn’t entirely clear, and honestly, it seemed like Embodied itself didn’t really know—they talked about “building technology that enables existing robot hardware to handle a much wider range of tasks where existing solutions break down,” and gave some examples of how that might be applied (including in manufacturing and logistics), but nothing more concrete.
Since then, a few things have happened. Thing one is that Embodied is now Covariant.ai. Thing two is that Covariant.ai spent almost a year talking with literally hundreds of different companies about how smarter robots could potentially make a difference for them. These companies represent sectors that include electronics manufacturing, car manufacturing, textiles, bio labs, construction, farming, hotels, elder care—“pretty much anything you could think about where maybe a robot could be helpful,” Pieter Abbeel tells us. “Over time, it became clear to us that manufacturing and logistics are the two spaces where there’s most demand now, and logistics especially is just hurting really hard for more automation.” And the really hard part of logistics is what Covariant decided to tackle.
There’s already a huge amount of automation in logistics, but as Abbeel explains, in warehouses there are two separate categories that need automation: “The things that people do with their legs and the things that people do with their hands.” The leg automation has largely been taken care of over the last five or 10 years through a mixture of conveyor systems, mobile retrieval systems, Kiva-like mobile shelving, and other mobile robots. “The pressure now is on the hand part,” Abbeel says. “It’s about how to be more efficient with things that are done in warehouses with human hands.”
A huge chunk of human-hand tasks in warehouses comes down to picking. That is, taking products out of one box and putting them into another box. In the logistics industry, the boxes are usually called totes, and each individual kind of product is referred to by its stock keeping unit number, or SKU. Big warehouses can have anywhere from thousands to millions of SKUs, which poses an enormous challenge to automated systems. As a result, most existing automated picking systems in warehouses are fairly limited. Either they’re specifically designed to pick a particular class of things, or they have to be trained to recognize more or less every individual thing you want them to pick. Obviously, in warehouses with millions of different SKUs, traditional methods of recognizing or modeling specific objects is not only impractical in the short term, but would also be virtually impossible to scale.
This is why humans are still used in picking—we have the ability to generalize. We can look at an object and understand how to pick it up because we have a lifetime of experience with object recognition and manipulation. We’re incredibly good at it, and robots aren’t. “From the very beginning, our vision was to ultimately work on very general robotic manipulation tasks,” says Abbeel. “The way automation’s going to expand is going to be robots that are capable of seeing what’s around them, adapting to what’s around them, and learning things on the fly.”
Covariant is tackling this with relatively simple hardware, including an off-the-shelf industrial arm (which can be just about any arm), a suction gripper (more on that later), and a straightforward 2D camera system that doesn’t rely on lasers or pattern projection or anything like that. What couples the vision system to the suction gripper is one single (and very, very large) neural network, which is what helps Covariant to be cost effective for customers. “We can’t have specialized networks,” says Abbeel. “It has to be a single network able to handle any kind of SKU, any kind of picking station. In terms of being able to understand what’s happening and what’s the right thing to do, that’s all unified. We call it Covariant Brain, and it’s obviously not a human brain, but it’s the same notion that a single neural network can do it all.”
We can talk about the challenges of putting picking robots in warehouses all day, but the reason why Covariant is making this announcement now is because their system has been up and running reliably and cost effectively in a real warehouse in Germany for the last four months.
This video is showing Covariant’s robotic picking system operating (for over an hour at 10x speed) in a warehouse that handles logistics for a company called Obeta, which overnights orders of electrical supplies to electricians in Germany. The robot’s job is to pick items from bulk storage totes, and add them to individual order boxes for shipping. The warehouse is managed by an automated logistics company called KNAPP, which is Covariant’s first partner. “We were searching a long time for the right partner,” says Peter Puchwein, vice president of innovation at KNAPP. “We looked at every solution out there. Covariant is the only one that’s ready for real production.” He explains that Covariant’s AI is able to detect glossy, shiny, and reflective products, including products in plastic bags. “The product range is nearly unlimited, and the robotic picking station has the same or better performance than humans.”
The key to being able to pick such a wide range of products so reliably, explains Abbeel, is being able to generalize. “Our system generalizes to items it’s never seen before. Being able to look at a scene and understand how to interact with individual items in a tote, including items it’s never seen before—humans can do this, and that’s essentially generalized intelligence,” he says. “This generalized understanding of what’s in a bin is really key to success. That’s the difference between a traditional system where you would catalog everything ahead of time and try to recognize everything in the catalog, versus fast-moving warehouses where you have many SKUs and they’re always changing. That’s the core of the intelligence that we’re building.”
To be sure, the details on how Covariant’s technology work are still vague, but we tried to extract some more specifics from Abbeel, particularly about the machine learning components. Here’s the rest of our conversation with him:
IEEE Spectrum: How was your system trained initially?
Pieter Abbeel: We would get a lot of data on what kind of SKUs our customer has, get similar SKUs in our headquarters, and just train, train, train on those SKUs. But it’s not just a matter of getting more data. Actually, often there’s a clear limit on a neural net where it’s saturating. Like, we give it more data and more data, but it’s not doing any better, so clearly the neural net doesn’t have the capacity to learn about these new missing pieces. And then the question is, what can we do to re-architect it to learn about this aspect or that aspect that it’s clearly missing out on?
You’ve done a lot of work on sim2real transfer—did you end up using a bajillion simulated arms in this training, or did you have to rely on real-world training?
We found that you need to use both. You need to work both in simulation and the real world to get things to work. And as you’re continually trying to improve your system, you need a whole different kind of testing: You need traditional software unit tests, but you also need to run things in simulation, you need to run it on a real robot, and you need to also be able to test it in the actual facility. It’s a lot more levels of testing when you’re dealing with real physical systems, and those tests require a lot of time and effort to put in place because you may think you’re improving something, but you have to make sure that it’s actually being improved.
What happens if you need to train your system for a totally new class of items?
The first thing we do is we just put new things in front of our robot and see what happens, and often it’ll just work. Our system has few-shot adaptation, meaning that on-the-fly, without us doing anything, when it doesn’t succeed it’ll update its understanding of the scene and try some new things. That makes it a lot more robust in many ways, because if anything noisy or weird happens, or there’s something a little bit new but not that new, you might do a second or third attempt and try some new things.
But of course, there are going to be scenarios where the SKU set is so different from anything it’s been trained on so far that some things are not going to work, and we’ll have to just collect a bunch of new data—what does the robot need to understand about these types of SKUs, how to approach them, how to pick them up. We can use imitation learning, or the robot can try on its own, because with suction, it’s actually not too hard to detect if a robot succeeds or fails. You can get a reward signal for reinforcement learning. But you don’t want to just use RL, because RL is notorious for taking a long time, so we bootstrap it off some imitation and then from there, RL can complete everything.
Why did you choose a suction gripper?
What’s currently deployed is the suction gripper, because we knew it was going to do the job in this deployment, but if you think about it from a technological point of view, we also actually have a single neural net that uses different grippers. I can’t say exactly how it’s done, but at a high level, your robot is going to take an action based on visual input, but also based on the gripper that’s attached to it, and you can also represent a gripper visually in some way, like a pattern of where the suction cups are. And so, we can condition a single neural network on both what it sees and the end-effector it has available. This makes it possible to hot-swap grippers if you want to. You lose some time, so you don’t want to swap too often, but you could swap between a suction gripper and a parallel gripper, because the same neural network can use different gripping strategies.
And I would say this is a very common thread in everything we do. We really wanted to be a single, general system that can share all its learnings across different modalities, whether it’s SKUs, end of arm tools, different bins you pick from, or other things that might be different. The expertise should all be sharable.
And one single neural net is versatile enough for this?
People often say neural networks are just black boxes and if you’re doing something new you have to start from scratch. That’s not really true. I don’t think what’s important about neural nets is that they’re black boxes—that’s not really where their strength comes from. Their strength comes from the fact that you can train end-to-end, you can train from input to the desired output. And you can put modular things in there, like neural nets that are an architecture that’s well suited to visual information, versus end-effector information, and then they can merge their information loads to come to a conclusion. And the beauty is that you can train it all together, no problem.
When your system fails at a pick, what are the consequences?
Here’s where things get very interesting. You think about bringing AI into the physical world—AI has been very successful already in the digital world, but the digital world is much more forgiving. There’s a long tail of scenarios that you could encounter in the real world and you haven’t trained against them, or you haven’t hardcoded against them. And that’s what makes it so hard and why you need really good generalization including few-shot adaptation and so forth.
Now let’s say you want a system to create value. For a robot in a warehouse, does it need to be 100 percent successful? No, it doesn’t. If, say, it takes a few attempts to pick something, that’s just a slowdown. It’s really the overall successful picks per hour that matter, not how often you have to try to get those picks. And so if periodically it has to try twice, it’s really the picking rate that’s affected, not the success rate that’s affected. A true failure is one where human intervention is needed.
With true failures, where after repeated attempts the robot just can’t pick an item, we’ll get notified by that and we can then train on it, and the next day it might work, but at that moment it doesn’t work. And even if a robotic deployment works 90 percent of the time, that’s not good enough. A human picking station can range from 300 to 2000 picks per hour. 2000 is really rare and is peak pick for a very short amount of time, so if we look at the bottom of that range, 300 picks per hour, if we’re succeeding 90 percent, that means 30 failures per hour. Wow, that’s bad. At 30 fails per hour, fixing those up by a human probably takes more than an hour’s worth of work. So what you’ve done now is you’ve created more work than you save, so 90 percent is definitely a no go.
At 99 percent that’s 3 failures per hour. If it usually takes a couple of minutes for a human to fix, at that point, a human could oversee 10 stations easily, and that’s where all of a sudden we’re creating value. Or a human could do another job, and just keep an eye on the station and jump in for a moment to make sure it keeps running. If you had a 1000 per hour station, you’d need closer to 99.9 percent to get there and so forth, but that’s essentially the calculus we’ve been doing. And that’s what you realize how any extra nine you want to get is so much more challenging than the previous nine you’ve already achieved.
There are other companies that are developing using similar approaches to picking—industrial arms, vision systems, suction grippers, neural networks. What makes Covariant’s system work better?
I think it’s a combination of things. First of all, we want to bring to bear any kind of learning—imitation learning, supervised learning, reinforcement learning, all the different kinds of learning you can. And you also want to be smart about how you collect data—what data you collect, what processes you have in place to get the data that you need to improve the system. Then related to that, sometimes it’s not just a matter of data anymore, it’s a matter of, you need to re-architect your neural net. A lot of deep learning progress is made that way, where you come up with new architectures and the new architecture allows you to learn something that otherwise would maybe not be possible to learn. I mean, it’s really all of those things brought together that are giving the results that we’re seeing. So it’s not really like any one that can be singled out as “this is the thing.”
Also, it’s just a really hard problem. If you look at the amount of AI research that was needed to make this work... We started with four people, and we have 40 people now. About half of us are AI researchers, we have some world-leading AI researchers, and I think that’s what’s made the difference. I mean, I know that’s what’s made the difference.
So it’s not like you’ve developed some sort of crazy new technology or something?
There’s no hardware trick. And we’re not doing, I don’t know, fuzzy logic or something else out of left field all of a sudden. It’s really about the AI stuff that processes everything—underneath it all is a gigantic neural network.
Okay, then how the heck are you actually making this work?
If you have an extremely uniquely qualified team and you’ve picked the right problem to work on, you can do something that is quite out there compared to what has otherwise been possible. In academic research, people write a paper, and everybody else catches up the moment the paper comes out. We’ve not been doing that—so far we haven’t shared the details of what we actually did to make our system work, because right now we have a technology advantage. I think there will be a day when we will be sharing some of these things, but it’s not going to be anytime soon.
It probably won’t surprise you that Covariant has been able to lock down plenty of funding (US $27 million so far), but what’s more interesting is some of the individual investors who are now involved with Covariant, which include Geoff Hinton, Fei-Fei Li, Yann LeCun, Raquel Urtasun, Anca Dragan, Michael I. Jordan, Vlad Mnih, Daniela Rus, Dawn Song, and Jeff Dean.
While we’re expecting to see more deployments of Covariant’s software in picking applications, it’s also worth mentioning that their press release is much more general about how their AI could be used:
The Covariant Brain [is] universal AI for robots that can be applied to any use case or customer environment. Covariant robots learn general abilities such as robust 3D perception, physical affordances of objects, few-shot learning and real-time motion planning, which enables them to quickly learn to manipulate objects without being told what to do.
Today, [our] robots are all in logistics, but there is nothing in our architecture that limits it to logistics. In the future we look forward to further building out the Covariant Brain to power ever more robots in industrial-scale settings, including manufacturing, agriculture, hospitality, commercial kitchens and eventually, people’s homes.
Fundamentally, Covariant is attempting to connect sensing with manipulation using a neural network in a way that can potentially be applied to almost anything. Logistics is the obvious first application, since the value there is huge, and even though the ability to generalize is important, there are still plenty of robot-friendly constraints on the task and the environment as well as safe and low-impact ways to fail. As to whether this technology will effectively translate into the kinds of semi-structured and unstructured environments that have historically posed such a challenge for general purpose manipulation (notably, people’s homes)—as much as we love speculating, it’s probably too early even for that.
What we can say for certain is that Covariant’s approach looks promising both in its present implementation and its future potential, and we’re excited to see where they take it from here.
[ Covariant.ai ]