Drones armed with computer vision software could enable new forms of automated skyborne surveillance to watch for violence below. One glimpse of that future comes from UK and Indian researchers who demonstrated a drone surveillance system that can automatically detect small groups of people fighting each other.
The seed idea for researchers to develop such a drone surveillance system was first planted in the wake of the Boston Marathon bombing that killed three and injured hundreds in 2013. That first attempt petered out. It was not until the Manchester Arena bombing that killed 23 and wounded 139—including many children leaving an Ariana Grande concert—when the researchers made some progress. This time, they harnessed a form of the popular artificial intelligence technique known as deep learning.
“This time we were able to do a relatively better job, because the software was able to run in realtime and does a relatively good job of detecting violent individuals,” says Amarjot Singh, a Ph.D. student in deep learning at the University of Cambridge.
The drone surveillance system developed by Singh and his colleagues remains far from ready for primetime. But their work demonstrates one possibility of combining deep learning’s pattern-recognition capabilities with relatively inexpensive commercial drones and the growing availability of cloud computing services. More details appear in a 3 June 2018 paper that was uploaded to the preprint server arXiv and will appear in the IEEE Computer Vision and Pattern Recognition (CVPR) Workshops 2018.
A key part of this demonstration involved training deep learning algorithms to recognize violent actions by detecting various combinations of body and limb poses in video footage. To create a training dataset, researchers enlisted 25 interns to gather in an open area and mimic violent actions in five categories such as punching, kicking, strangling, stabbing and shooting while being filmed by a Parrot AR drone from various heights ranging from 2 meters to 8 meters.
But that wasn’t all. The research team also needed to sit down and manually mark 18 coordinates on each person’s body in the video frames. That would have quickly become a labor-intensive and exhausting process for 10,000 or 20,000 images normally needed to train deep learning algorithms. The researchers wanted to cut down the amount of necessary training data to just 2,000 annotated images that included about 5,000 individuals performing violent actions.
An unsupervised deep learning neural network automatically learns patterns over time by filtering data through its many layers of artificial neurons from end to end—a process that can yield good predictive accuracy if you have enough computing resources and training data on your hands. Singh’s workaround solution came from his Cambridge University research that has focused on more streamlined and efficient forms of deep learning capable of running with fewer computing resources and less training data.
Singh replaced some of the first neural network layers at the front-end with fixed parameters and used supervised learning toward the back-end. This move effectively replaced some of the deep learning process with human engineering input based on what Singh, the human designer, thought would work best for training the neural network to recognize different human body poses. That could mean a possible tradeoff in overall accuracy, but it enabled the resulting ScatterNet Hybrid Deep Learning (SHDL) network to learn more quickly with less data and less available computing power.
The overall drone surveillance system relies upon the SHDL network along with two standard deep learning algorithms. The first, called a feature pyramid network, is a common component of object recognition systems and performs the first task of detecting humans in video images. The second, called a support vector machine, uses the information from the SHDL network’s body pose estimations to categorize people as either being violent or nonviolent.
Initial test results suggest that the drone surveillance system can indeed work in realtime by having the Parrot drone offload the heavy-duty data crunching to Amazon’s cloud service. Singh’s colleagues at the Indian Institute of Science Bangalore and National Institute of Technology Warangal handled the drone part of the system.
The Drone Surveillance System highlights violent individuals in red and neutral individuals in cyan.Images: University of Cambridge/National Institute of Technology/Indian Institute of Science/IEEE
But it’s still early days as far as accuracy goes. The drone surveillance system’s accuracy steadily declines from 94 percent—success in recognizing one violent individual—as more and more violent individuals fill the images. That dropoff in accuracy may come from the difficulty of recognizing larger numbers of people who are spread out at varying distances from the drone camera, Singh says. It may also come from miscategorizations of people’s poses.
The fact that the system’s accuracy drops with more people in the video frame raises the question as to how accurate it will be in analyzing large crowds. Performing real-time analyses of many more people than just the 25 interns could strain the system and require even more cloud computing resources and bandwidth.
Furthermore, the initial training dataset based on the simulated brawl among interns may not exactly reflect all the real-life violence that takes place in large crowd riots or terrorist attacks. That means the drone surveillance system’s accuracy in recognizing real-world violence is anyone’s guess until the researchers can test it on such video footage. “People punch in many different ways,” Singh says. “There’s not one or two ways of doing it.”
Still, the researchers are pressing forward. They’re in the process of securing permission from Indian officials to try out their system at two upcoming music festivals. Such real-world tests could help them figure out the limits of the current drone surveillance system’s capabilities, given the expected presence of thousands of people in densely-packed crowds. (One student attendee was stabbed at one of the festivals last year.)
Singh is continuing to develop the deep learning models to possibly incorporate crowd modelling. He also anticipates expanding the system’s object recognition capabilities to include being able to spot a person carrying a gun or a bag. For example, having a real-time surveillance system capable of tracking suspicious patterns among people carrying bags might have proven useful in the Boston Marathon bombing case.
There is even the possibility that Singh and his colleagues will move away from trying to have the system recognize specific acts of violence, such as stabbing or kicking, and instead focus on recognizing possible violence in general. Singh wants to see if that may yield a more practical implementation of the system down the line.
If the drone surveillance system’s accuracy does improve to the point of being commercially viable, Singh still envisions humans being in the loop to check out any suspicious activity or possible outbreaks of violence that the drone highlights. The automated surveillance system would help narrow down the range of places a human security guard should look, so that the human brain and eyes can take over and quickly exercise proper judgment about the situation.
"The system is not going to go on its own and fly and kill people," Singh says.
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.