Geoscience researchers are excited by a new big-data effort to connect millions of hard-won scientific records in databases around the world. When complete, the network will be a virtual portal into the ancient history of the planet.
The project is called Deep-time Digital Earth, and one of its leaders, Nanjing-based paleontologist Fan Junxuan, says it unites hundreds of researchers—geochemists, geologists, mineralogists, paleontologists—in an ambitious plan to link potentially hundreds of databases.
The Chinese government has lined up US $75 million for a planned complex near Shanghai that will house dedicated programming teams and academics supporting the project, and a supercomputer for related research. More support will come from other institutions and companies, with Fan estimating total costs to create the network at about $90 million.
Fossil Record: Deep-time Digital Earth will make it easier for scientists to study fossils such as these. The project is led by paleontologist Fan Junxuan (bottom).Photos: Fan Junxuan
Right now, a handful of independent databases with more than a million records each serve the geosciences. But there are hundreds more out there holding data related to Earth's history. These smaller collections were built with assorted software and documentation formats. They're kept on local hard drives or institutional servers, some decades old, and converted from one format into another as time, funding, and interest allow. The data might be in different languages and is often guided by informal or variably defined concepts. There is no standard for arranging the hundreds of tables or thousands of fields. This archipelago of information is potentially very useful but hard to access.
Fan saw an opportunity while building a database comprising the Chinese geological literature. Once it was complete, he and his colleagues were able to use parallel computing programs to examine data on 11,000 marine fossil species in 3,000 geological sections. The results dated patterns of paleobiodiversity—the appearance, flowering, and extinction of whole species—at a temporal resolution of 26,000 years. In geologic time, that's pretty accurate.
The Deep-time project planners want to build a decentralized system that would bring these large and small data sources together. The main technical challenge is not to aggregate petabytes of data on centralized servers but rather to script strings of code. These strings would work through a programming interface to link individual databases so that any user could extract information through that interface.
Harmonizing these data fields requires human beings to talk to one another. Fan and his colleagues hope to kick off those discussions in New Delhi, which in March is hosting a big gathering of geoscientists. A linked network could be a gold mine for researchers scouring geologic data for clues.
In a 19th-century building behind Berlin's Museum für Naturkunde, micropaleontology curator David Lazarus and paleobiologist postdoc Johan Renaudie run the group's Neptune database, which is likely to be linked with Deep-time Digital Earth as it develops. Neptune holds a wealth of data on core samples from the world's ocean floors. Lazarus started the database in the late 1980s, before the current SQL language standard was readily available—at that time it was mostly found only on mainframes. Renaudie explains that Neptune has been modified from its incarnation as a relational database using 4th Dimension for Mac, and has been carefully patched over the years.
There are many such patched-up archives in the field, and some researchers start, develop, and care for data centers that drift into oblivion when funding runs out. “We call them whale fall," Lazarus says, referring to dead whales that sink to the ocean floor.
Creating a database network could keep this information alive longer and distribute it further. It could lead to new kinds of queries, says Mike Benton, a vertebrate paleontologist in Bristol, England, making it possible to combine independent data sources with iterative algorithms that run through millions or billions of equations. Doing this can deliver more precise time resolutions, which hitherto has been really difficult. “If you want to analyze the dynamics of ancient geography and climate and its influence on life, you need a high-resolution geological timeline," Fan says. “Right now this analysis is not available."
This article appears in the March 2020 print issue as “Data Project Aims to Organize Scientific Records."
Michael Dumiak is a Berlin-based writer and reporter covering science and culture and a longtime contributor to IEEE Spectrum. For Spectrum, he has covered digital models of ailing hearts in Belgrade, reported on technology from Minsk and shale energy from the Estonian-Russian border, explored cryonics in Saarland, and followed the controversial phaseout of incandescent lightbulbs in Berlin. He is author and editor of Woods and the Sea: Estonian Design and the Virtual Frontier.