Among all of the self-driving startups working toward Level 4 autonomy (a self-driving system that doesn’t require human intervention in most scenarios), Mountain View, Calif.-based Drive.ai’s scalable deep-learning approach and aggressive pace make it unique. Drive sees deep learning as the only viable way to make a truly useful autonomous car in the near term, says Sameep Tandon, cofounder and CEO. “If you look at the long-term possibilities of these algorithms and how people are going to build [self-driving cars] in the future, having a learning system just makes the most sense. There’s so much complication in driving, there are so many things that are nuanced and hard, that if you have to do this in ways that aren’t learned, then you’re never going to get these cars out there.”
It’s only been about a year since Drive went public, but already, the company has a fleet of four vehicles navigating (mostly) autonomously around the San Francisco Bay Area—even in situations (such as darkness, rain, or hail) that are notoriously difficult for self-driving cars. Last month, we went out to California to take a ride in one of Drive’s cars, and to find out how it’s using deep learning to master autonomous driving.
As its name suggests, Drive.ai was born for this. It was founded in 2015 by deep-learning experts from Stanford University’s Artificial Intelligence Laboratory. By structuring its approach to autonomous driving entirely around deep learning from the very beginning, Drive has been able to rapidly and adaptably scale to safely handle the myriad driving situations that autonomous cars need to master.
“I think this is the first time autonomous driving has been approached so strongly from a deep-learning perspective,” says Tandon. “This is in contrast to a traditional robotics approach,” continues Carol Reiley, cofounder and president. “A lot of companies are just using deep learning for this component or that component, while we view it more holistically,” says Reiley.
The most common implementation of the piecemeal approach to which they’re referring is the use of deep learning solely for perception. This form of artificial intelligence is good for, say, recognizing pedestrians in a camera image because it excels at classifying things within an arbitrary scene. What’s more, it can, after having learned to recognize a particular pattern, extend that capability to objects that it hasn’t actually seen before. In other words, you don’t have to train it on every single pedestrian that could possibly exist for it to be able to identify a kind old lady with a walker and a kid wearing a baseball cap as part of the same group of objects.
While a pedestrian in a camera image is a perceptual pattern, there are also patterns in decision making and motion planning—the right behavior at a four way stop, or when turning right on red, to name two examples—to which deep learning can be applied. But that’s where most self-driving carmakers draw the line. Why? These are the kind of variable, situation-dependent decisions that deep-learning algorithms are better suited to making than the traditional, rules-based approach with which they feel more comfortable, Reiley and Tandon tell us. Though deep learning’s “humanlike” pattern recognition leads to a more nuanced behavior than you can expect from a rules-based system, sometimes this can get you into trouble.
The Black Box
A deep-learning system’s ability to recognize patterns is a powerful tool, but because this pattern recognition occurs as part of algorithms running on neural networks, a major concern is that the system is a “black box.” Once the system is trained, data can be fed to it and a useful interpretation of those data will come out. But the actual decision-making process that goes on between the input and output stages is not necessarily something that a human can intuitively understand. This is why many companies working on vehicle autonomy are more comfortable with using traditional robotics approaches for decision making, and restrict deep learning to perception. They reason that if your system makes an incorrect decision, you’d want to be able to figure out exactly what happened and then make sure that the mistake won’t be repeated.
“This is a big problem,” Tandon acknowledges. “What we want to be able to do is to train deep-learning systems to help us with the perception and the decision making but also incorporate some rules and some human knowledge to make sure that it’s safe.” While a fully realized deep-learning system would use a massive black box to ingest raw sensor data and translate that into, say, a turn of the steering wheel or activation of the accelerator or brakes, Drive has intentionally avoided implementing a complete end-to-end system like that, Tandon says. “If you break it down into parts where you’re using deep learning, and you understand that different parts can be validated in different ways, then you can be a lot more confident in how the system will behave.”
There are a few tricks that can be used to peek into the black box a little bit, and then validate (or adjust) what goes on inside of it, say the Drive researchers. For example, you can feed very specific data in, like a camera image with almost everything blacked out except the thing you want to query, and then see how your algorithm reacts to slightly different variations of that thing. Simulation can also be a very helpful tool when dealing with specific situations that an algorithm is having difficulty with, as Tandon explains:
When we first started working on deep-learning perception systems, one of the challenges we had was with overpasses. We’d go out and drive, and the shadow caused by an overpass would cause the system to register a false positive of an obstacle. In the learning process, you can focus the algorithm on challenging scenarios in a process called hard mining. We then augment the data set with synthetic examples, and with that, what you do is say, “Hey, system, tell me what you’re going to do on this overpass, and then I’m going to jitter it a little bit, and you’re going to do it again.” Over time, the system starts to work around overpasses, and then you can validate it on a systematic level.
Training the System
Deep-learning systems thrive on data. The more data an algorithm sees, the better it’ll be able to recognize, and generalize about, the patterns it needs to understand in order to drive safely. For autonomous cars, which need to be able to understand a vast array of different situations, the default approach taken by most companies working on autonomous driving is to just collect as much data as possible. The issue, then, becomes managing the data and then doing something useful with it. Drive.ai keeps in mind that data is not all created equal. The company puts an immense amount of effort into collecting high quality data and then annotating it so that it’s useful for training deep-learning algorithms.
Before cars can drive themselves, the laborious task of annotating the objects in every scene a self-driving car’s sensors capture must be completed. This data is what feeds deep-learning or rules-based algorithms.Image: Drive.ai
Annotation, while very simple, is also very tedious: A human annotator is presented with a data set, perhaps a short video clip or even just a few frames of video or lidar data, and tasked with drawing and labeling boxes around every car, pedestrian, road sign, traffic light, or anything else that might possibly be relevant to an autonomous driving algorithm. “We’ve learned that certain companies have a large army of people annotating,” says Reiley.”Thousands of people labeling boxes around things. For every one hour driven, it’s approximately 800 human hours to label. These teams will all struggle. We’re already magnitudes faster, and we’re constantly optimizing.”
How is that possible? Drive has figured out how to use deep-learning-enhanced automation for annotating data. So, it has a small band of human annotators, most of whom are kept busy training brand new scenarios or validating the annotation that the system does on its own. “There are some scenarios where our deep-learning system is working very well, and we just need to have a validation step,” Tandon explains. “And there are some scenarios where we’ve improving the algorithm and we need to bootstrap it the right way, so we have a team of human annotators do the first iteration, and we iteratively improve the deep-learning system. Already in many cases, our deep-learning systems perform better than our expert annotators.” Reiley adds: “Think about how mind-blowing that is.”
It’s difficult for the Drive team to articulate exactly what prevents other companies from building their own deep-learning infrastructure and tools and doing the same kind of deep-learning-based annotation and training. “We talk about this quite often: What’s to stop someone else from doing exactly what we’re doing?” says Tandon. “To be honest, there are just so many parts to the problem. It’s such an integrated system; there’s just so many components to get right throughout the entire stack, that it becomes hard to say there’s one specific reason why this works well.”
Reiley agrees: “Your decisions have to be software driven and optimized for deep learning, for software and hardware integration. Everyone focuses only on the algorithm portion of it, but we have these other applications that all need to come together. Autonomous driving is much more than just an algorithm— it’s a very complicated hardware-software problem that nobody has solved before.”
Sensors in the Rain
The hardware in Drive’s four-car fleet is designed to be retrofit onto most vehicles with a minimum of hassle and is concentrated in an array of sensors, including cameras and lidar, located on the roof. The system also takes advantage of the car’s own integrated sensors, like radar (used for adaptive cruise control) and rear cameras. There’s also a big display that Drive eventually plans to use for communicating with human drivers and pedestrians; we go into that in more detail in this article.
With a suite of nine HD cameras, two radars, and six Velodyne Puck lidar sensors, each of Drive’s vehicles is continuously capturing data for map generation, for feeding into deep-learning algorithms, and of course for the driving task itself. The current sensor loud out is complex and expensive, but as Drive cofounder Joel Pazhayampallil explains, it’s almost certainly overkill, and will be reduced when Drive moves into pilot programs. “I think we’ll need a significantly smaller subset, probably half what we have right now, if that,” Pazhayampallil says. “Our algorithms are constantly improving. We’re constantly getting more and more out of each individual sensor by combining data from the different sensors together. We get some low-resolution depth data from the lidar and really high-resolution context information from the camera.”
This kind of multimodal redundancy and decision making through deep learning based on fused sensor data has advantages in an autonomous vehicle context. Namely, it offers some protection against sensor failure, since the deep-learning algorithms can be trained explicitly on perception data with missing sensor modalities. Deep learning has a significant advantage over rules-based approaches here, since rules conflicts can lead to failures that can be, according to Pazhayampallil, “catastrophic.” And sensor failure is most often not a hardware or software issue but rather a sensor that isn’t producing good data for some reason, like sun glare, darkness at night, or (more commonly) being occluded by water.
The reason that driving in the rain is a challenge for autonomous cars isn’t just that water absorbs the lidar energy, or that surfaces turn reflective. You can’t really tell in the above video, but Drive showed us feeds from the car’s roof cameras, which had big drops of water on the lenses that rendered them mostly useless. “If you’re driving in harder, more nuanced situations, you need to be able to handle camera failures, lidar failures, radar failures, whatever happens in the real world,” Tandon says.
On the Road
We took our demo ride with Tory Smith, Drive’s technical program manager. Sadly, we managed to miss all of the horrible weather that California has been having over the past month, and went out on the sort of dry and sunny (but not too sunny) day that autonomous cars love. Drive is targeting Level 4 autonomy. But, in accordance with current laws, the company is testing its vehicles under Level 2 autonomy. At Level 2, a human driver must be in the driver’s seat, ready to take over at any time, although the car is expected to do most (if not all) of the driving by itself. While Smith and I talked about what the car was doing, Leilani Abenojar, one of Drive.ai’s autonomous-vehicle operators, stayed focused on the road.
The 20-minute ride through a premapped area of suburban Mountain View, Calif., featured 16 intersections and a four-way stop. In general, the car performed smoothly and competently, but it was less assertive than an average human driver would be. This is intentional, says Smith: “We have to design for an acceptable envelope of performance, and it’s very difficult to have a vehicle that’s assertive but [doesn’t cross the line into being so] assertive that our guests or drivers are uncomfortable. We always operate with an abundance of caution, and would always rather that you be bored than be uncomfortable.”
For our demo ride, that aim was achieved. Realistically, a boring experience is exactly what you want from a safe and trustworthy autonomous car. There was one exception, where our safety driver had to disengage the autonomous system and take manual control. And although we were never at risk of any sort of accident, it’s these exceptions that provide the best perspective on the current state of Drive’s autonomy.
We were stopped at a red light, waiting to make a right turn onto a main road. Right turns on red are legal in California, but Drive doesn’t usually try to make them autonomously because of sensor limitations, explains Smith. “Our lidar sensors can only see about 50 to 75 meters reliably, and on a road like this, where you can have cross traffic at 45 or 50 miles per hour [72 or 80 km/h], you can’t yet with sufficient confidence detect cross traffic and know exactly what lane it’s going to be in.” When the light turned green, the car made the turn into the rightmost lane (which is legally how a right turn is supposed to be made). But there was a truck pulled over on the side of the road, blocking that lane, so our safety driver briefly took over, steered around the truck, and then reengaged the autonomous system.
“Nominally, we would have waited for the truck to get out of the way,” Smith told me. “In terms of path planning, having the vehicle compensate for obstructions on the fly like that is a place where we’re currently building in more capability.” This situation is more than just a path planning problem, though: Waiting for the truck to move is the right call, if the truck is active. If it’s not, you’d want to go around. Put yourself in the car’s position: How do you know if a stopped truck is likely to move again? Maybe you can tell whether the engine is running, or notice indicator lights flashing, or identify some activity around the truck. These are all things that a human can do almost immediately (and largely subconsciously), but an autonomous vehicle needs to be explicitly trained on what to look for and how to react.
“Humans aren’t necessarily perfect at doing very precise things,” says Smith, “but they’re great at improvising and dealing with ambiguity, and that’s where the traditional robotics approach breaks down, is when you have ambiguous situations like the one we just saw. The nice thing about developing a system in a deep-learning framework is when you encounter difficult situations like that, we just have to collect the data, annotate the data, and then build that module into our deep-learning brain to allow the system to be able to compensate for that in the future. It’s a much more intuitive way to solve a problem like that than the rules-based approach, where you’d basically have to anticipate everything that could ever happen.”
Disengagements like these are where most of Drive’s most valuable learning happens. It’s actively looking for test routes that include challenges, leading the company to do a significant amount of testing in the San Francisco Bay Area, which can be stressful even for human drivers. Drive’s modus operandi: Once its vehicles can reliably navigate a route without getting disengagements, the team will pick a new route with different challenges. Because it can annotate and train on new data so efficiently, Drive expects that its speed of adaptability will enable it to conquer new terrain very rapidly.
At the end of our demo drive, Smith asked me if I thought the car’s driving was more “humanlike” than other autonomous car demos I’d had in the past. Describing any robot as doing anything “humanlike” can get you into all sorts of trouble, so I asked Smith to explain what it means, from a technical standpoint, for an autonomous car to exhibit humanlike behavior or intelligence.
Smith gave me an example of how Drive does traffic light detection. The prevailing methodology for detecting traffic lights is to map out every intersection that your car will drive through in enough detail that you can tell your cameras exactly where to look. For a closed course, you can use this kind of brute-force approach, but it breaks down if you try to scale it up even to the level of a city. Drive instead has collected more generalized data on traffic lights, annotating what a traffic light looks like in different intersections, from different angles, in daytime, in nighttime, and during rain and snow and fog. These annotated data are fed into Drive’s deep learning algorithms, and the system learns to recognize the general concept of a traffic light, much in the same way that humans do, as Smith explains:
The other thing about deep learning that’s really nice is that we get to use the context of the entire scene, not just the lights themselves. As a human, for example, there are situations where you start to go at a green light without necessarily looking at the light itself, because you’re looking at what everyone else is doing. Our annotation tool and human annotators take that into account: Maybe they can’t see all the lights, but they see that the cars next to them are all moving, so it’s probably green. That system has actually accrued enough humanlike intelligence that our traffic light detector is now more accurate than a human, which is really exciting. And we can expand that same intelligence to other aspects of our deep learning brain.
Out in the World
For better or worse, Drive’s ability to maintain its aggressive, um, drive toward Level 4 autonomy (vehicles that operate in specific areas and conditions without a safety driver) likely depends, in the short term, on state and federal regulations. But Sameep Tandon, at least, is highly optimistic about the trajectory of Drive’s autonomous technology. “We’re in the process this year of doing pilots with some of our customers. In the next six months, we’re hoping to deploy this—not at a huge scale, but at least so that more people can use it. I believe that in geo-fenced areas, we could have a solution without a safety driver operating in the next year or two. Ultimately, the question will be how quickly can we go from the Bay Area to the next city and the one after that.”
Drive’s plan is to focus initially on logistics: businesses that repeatedly deliver things across small areas, as opposed to something like ride sharing. This gives Drive a well constrained problem as well as a smooth expansion path, and avoids having to deal with human passengers, at least initially. Beyond that, Tandon is excited about the future: “I think if you have a combination of really good strategy and really good technology, this could be one of the first uses of robots in the real world. That, to me, is super exciting; I’d love to see robots everywhere, and self-driving cars are probably going to be the first that everyday people interact with.”
Evan Ackerman is a senior editor at IEEE Spectrum. Since 2007, he has written over 6,000 articles on robotics and technology. He has a degree in Martian geology and is excellent at playing bagpipes.