THE INSTITUTE In the race to develop a vaccine for the novel coronavirus, health care providers and scientists must sift through a growing mountain of research, both new and old. But they face several obstacles. The sheer volume of material makes using traditional search engines difficult because simple keyword searches aren’t sufficient to extract meaning from the published research. This is further complicated by the fact that most search engines present research results in visual file formats like pdfs and bitmaps, which are unreadable to typical web browsers.
IEEE Member Peter Staar, a researcher at IBM Research Europe, in Zurich, and manager of the Scalable Knowledge Ingestion group, has built a platform called Deep Search that could help speed along the process. The cloud-based platform combs through literature, reads and labels each data point, table, image, and paragraph, and translates scientific content into a uniform, searchable structure.
The reading function of the Deep Search platform consists of a natural language processing (NLP) tool called the corpus conversion service (CCS), developed by Staar for other information-dense domains. The CCS trains itself on already-annotated documents to create a ground truth, or knowledge base, of how papers in a given realm are typically arranged, Staar says. After the training phase, new papers uploaded to the service can be quickly compared to the ground truth for faster recognition of each element.
Once the CCS has a general understanding of how papers in a field are structured, Staar says, the Deep Search platform presents two options. It can either generate simple results in response to a traditional search query, essentially serving as an advanced pdf reader, or it can generate a report on a specific topic, such as the dosage of a particular drug, with deeper analysis that the group calls a knowledge graph.
“[The] knowledge graph allows us to answer these relatively complex questions that are not able to be answered with just a keyword lookup,” Staar explains.
To keep the data in the platform’s knowledge base up to the highest standards possible, Staar says the team bolsters their corpora with trusted, open-source databases such as DrugBank for chemical, pharmaceutical, and pharmacological drug data and GenBank for established and publicly available data sequences.
Deep Search is based on a similar platform that Staar built in 2018 for material science and for oil and gas research, fields that both faced a deluge of data. Staar recognized that the same solution could be used to parse the tsunami of data about SARS-CoV-2. The platform was designed to be generic enough to be extended to other domains of research.
“Our goal was to help the medical community with a tool that we already had in our hands,” Staar says. Currently, the COVID-19 Deep Search service supports 460 active users and has ingested nearly 46,000 scientific articles.
The platform can even use search queries to divide results according to scientific camp.
“In the oil and gas business, when different philosophies [on environmental impact] collide, you can say, ‘Okay, if you follow a certain stream of thought, then you might be more interested in papers that are associated with this group of people, rather than with that group,’” Staar says.
If the scientific community is divided on a major attribute of SARS-CoV-2, for example, Deep Search might cluster search results around each camp. When a user searches for that attribute, the platform could analyze the wording of their search string and then guide the user to the cluster of results that most closely aligns with the user’s approach.
This isn’t the first time a pressing global health crisis has prompted scientists to try to streamline the publishing process. A 2010 analysis of literature from the 2003 SARS outbreak found that, despite efforts to shorten wait times for both acceptance and publishing, 93 percent of the papers on SARS didn’t come out until the epidemic had already ended and the bulk of deaths had already occurred.
Unlike their counterparts in 2003, however, present-day epidemic researchers have benefitted from the advent of preprint servers such bioRxiv and medRxiv, which enable uncorrected articles to be shared digitally regardless of acceptance or submission status. Preprints have been around since the early 1990s, but the public health emergency of SARS-CoV-2 prompted a new surge in popularity for the alternative publishing practice, as well as a new round of concern over its impact.
Deep Search capitalizes on the preprint trend to further reduce obstacles to sharing the content of research papers. But it also aims to address one of the chief criticisms of preprints: that without peer review, the average reader may be unable to distinguish high-quality research from low-quality research. Though every new paper has equal weight in the Deep Search algorithms, the volume of data it ingests allows for statistical comparisons among conclusions. Users can easily see whether a result is consistent with previous findings or seems to be an outlier.
These relational functions, in which Deep Search sorts, links, and compares data as it returns results constitute the platform’s signature advantage, Staar says. Developing a treatment molecule, for example, might start with a search to determine which gene to target within the viral RNA.
“If you understand which genes are important, then you can start understanding which proteins are important, which leads you to which kinds of molecules you can build for which kinds of targets,” he says. “That’s what our tool is really built for.”