The world’s largest police network is evaluating software that would match samples of speech taken from phone calls or social media posts to voice recordings of criminals stored within a massive database shared by law enforcement agencies.
The platform, as described by developers, would employ several speech analysis algorithms to filter voice samples by gender, age, language, and accent. It will be managed by Interpol at its base in Lyon, France with a goal of increasing the accuracy of voice data, and boosting its reliability and judicial admissibility.
The development team completed successful field tests of the system in March and November 2017. Next up is a project review this June in Brussels.
While the system can process any “lawfully intercepted” sound, including ambient conversation, its expected use would be to match voices gleaned from telephone and social media against a “blacklist” database. The samples could come from mobile, landline, or voice-over-Internet-protocol recordings, or from snatches of audio captured from recruitment or propaganda videos posted to social media.
That recorded data essentially becomes a widget on a production line. This file, the captured voice clip, may already include some descriptive metadata added by the law enforecement officials who originally secured it. The software then attempts to add new information about the speaker's age or accent, for example.
To help with this task, the SIIP platform would create a template of a given police recording of a phone call, marking the acoustic features that represent the voices on the clip. Those features, or identity vectors, are then used to try to find matches in the database.
To create the software, developers lined up algorithms, or modules, to sort newly recorded voice samples through a processing chain built on open-sourced architecture. Interim reports, issued in June 2016, May 2017, and February 2018, say the challenges of building such a system included setting up tools to filter out background noise, enhance voice clarity, isolate sounds, and to easily share, gather, and classify data for applications both at police headquarters and in the field.
The point is to be able to match a new recording against a very large database of sound samples stored in a database that may have more than a million records on file. That database would be managed by Interpol; the records would be populated by the institution’s member law enforcement agencies. These agencies, from 192 countries, would have access to the system.
The platform can also match voice samples taken from social media platforms including Twitter, Google+, LinkedIn, YouTube, and Facebook. By combing through multimedia content based on search criteria such as language relevance and geolocation, the system will tag and process this material, and find similar clips in the database. The software’s video processing engine can extract the audio from an online video, split it into mono, and format it into uncompressed 16 kilohertz WAV files. Audio-only content can also be searched and tagged in this way.
Coordinating the project is Verint, an “actionable intelligence” company based in New York and Israel. Verint’s roots are in commercial call recording—think of hearing “this call may be recorded for quality control and training purposes.” The firm worked with Airbus, Singular Logic, and Nuance to develop the system, with keyword spotting components from Sail Labs of Vienna and Swiss research nonprofit IDIAP. Security groups in the Netherlands and the United Kingdom studied the project’s ethical aspects. Input from law enforcement came from Interpol, the Italian Carabinieri, the U.K.’s Metropolitan Police, Germany’s Bundeskriminalamt, and Portugal’s Policia Judiciaria.
As with the broader field of automated voice surveillance, the project is spurring complex reactions. “I consider speech recognition in the hands of police and secret services to be quite dangerous. I have objections,” says Matthias Monroy, a Berlin-based activist who edits a civil rights journal. Monroy has been keeping tabs on the SIIP effort since it launched in 2014.
Paul Johannes, a research associate in commercial law at the University of Kassel and a member of Forum Privatheit, a Berlin-based digital privacy organization, says law enforcement is always on the hunt for tools in a race against new techniques developed by criminal or terror organizations.
But political context is everything, says Maya Wang, a senior researcher and China expert at Human Rights Watch, who recently helped produce a report criticizing the Beijing government’s work to build a database of voice pattern samples enhanced by artificial intelligence. She sees a tripolar context: there is China and its “Wild East” of surveillance, with no meaningful protections, as opposed to stricter rules in Europe, and a looser framework in the United States that is still connected to a vibrant civil society and rule of law. The malign potential for automated voice recognition, Wang argues, depends on where it is used.
Complicating matters, the European Union is about to enact its General Data Protection Regulation (GDPR), a sweeping consumer data privacy package. There are mixed opinions on whether the directive would impact voice recognition tools such as SIIP. Johannes says the GDPR has a “forgotten twin” directive which regulates police or intelligence processing of personal data and sets rules for its free movement.
Many law enforcement agencies already use voice recognition packages. An Interpol survey of 91 departments in 69 countries showed that more than half already run automated speaker identification of some sort.
For example, STC Group, a European subsidiary of the Russia-based Speech Technology Center, offers a voice recognition suite called Voice Grid, deployed in Mexico in 2011 and in Ecuador in 2015. STC makes a point of separating a so-called “voice print” from underlying raw voice data—in the event a database containing the voice prints is hacked, personally identifying data is already stripped out.
Verint and Interpol did not return repeated requests for comment. One of the goals of the system is to improve prospects for using voice recognition in court cases. But if Interpol goes ahead with the SIIP platform, sources say, the distinguishing feature may well be the database.
Geoffrey Stewart Morrison, an associate professor at the Center for Forensic Linguistics at Aston University in Birmingham, U.K., says there is a big difference between using voice data in court and using voice recognition as an investigative tool. Through published works, he and a colleague have drawn sharp lines for speech comparison testimony in court.
The Interpol platform might prove just as useful in narrowing lists of potential suspects as in prosecuting criminals. Morrison says individual law enforcement agencies may already buy existing systems for their own purposes, but they may not share data even within their own countries. Interpol’s role, however, is to ease information-sharing among law enforcement.
This analysis could also be considered a warning, given recent concerns about companies hoovering data from social media platforms like Facebook. As the activist Monroy points out, the general public is only recently aware of the huge extent to which their written communication can be monitored and filtered for keywords. “They should know that this works with speech as well,” he says.
Michael Dumiak is a Berlin-based writer and reporter covering science and culture and a longtime contributor to IEEE Spectrum. For Spectrum, he has covered digital models of ailing hearts in Belgrade, reported on technology from Minsk and shale energy from the Estonian-Russian border, explored cryonics in Saarland, and followed the controversial phaseout of incandescent lightbulbs in Berlin. He is author and editor of Woods and the Sea: Estonian Design and the Virtual Frontier.