The December 2022 issue of IEEE Spectrum is here!

Close bar

Ambitious Data Project Aims to Organize the World’s Geoscientific Records

Deep-time Digital Earth will link hundreds of bespoke scientific databases in one easy-to-search network

3 min read
Photo of a fossilized plant.
Photo: Fan Junxuan

Geoscience researchers are excited by a new big-data effort to connect millions of hard-won scientific records in databases around the world. When complete, the network will be a virtual portal into the ancient history of the planet.

The project is called Deep-time Digital Earth, and one of its leaders, Nanjing-based paleontologist Fan Junxuan, says it unites hundreds of researchers—geochemists, geologists, mineralogists, paleontologists—in an ambitious plan to link potentially hundreds of databases.

The Chinese government has lined up US $75 million for a planned complex near Shanghai that will house dedicated programming teams and academics supporting the project, and a supercomputer for related research. More support will come from other institutions and companies, with Fan estimating total costs to create the network at about $90 million.

Photo of a fossilized leaf.

Photo of a fossilized plant.

Photo of Fan JunxuanFossil Record: Deep-time Digital Earth will make it easier for scientists to study fossils such as these. The project is led by paleontologist Fan Junxuan (bottom).Photos: Fan Junxuan

Right now, a handful of independent databases with more than a million records each serve the geosciences. But there are hundreds more out there holding data related to Earth's history. These smaller collections were built with assorted software and documentation formats. They're kept on local hard drives or institutional servers, some decades old, and converted from one format into another as time, funding, and interest allow. The data might be in different languages and is often guided by informal or variably defined concepts. There is no standard for arranging the hundreds of tables or thousands of fields. This archipelago of information is potentially very useful but hard to access.

Fan saw an opportunity while building a database comprising the Chinese geological literature. Once it was complete, he and his colleagues were able to use parallel computing programs to examine data on 11,000 marine fossil species in 3,000 geological sections. The results dated patterns of paleobiodiversity—the appearance, flowering, and extinction of whole species—at a temporal resolution of 26,000 years. In geologic time, that's pretty accurate.

The Deep-time project planners want to build a decentralized system that would bring these large and small data sources together. The main technical challenge is not to aggregate petabytes of data on centralized servers but rather to script strings of code. These strings would work through a programming interface to link individual databases so that any user could extract information through that interface.

Harmonizing these data fields requires human beings to talk to one another. Fan and his colleagues hope to kick off those discussions in New Delhi, which in March is hosting a big gathering of geoscientists. A linked network could be a gold mine for researchers scouring geologic data for clues.

In a 19th-century building behind Berlin's Museum für Naturkunde, micropaleontology curator David Lazarus and paleobiologist postdoc Johan Renaudie run the group's ­Neptune database, which is likely to be linked with Deep-time Digital Earth as it develops. Neptune holds a wealth of data on core samples from the world's ocean floors. Lazarus started the database in the late 1980s, before the current SQL language standard was readily available—at that time it was mostly found only on mainframes. Renaudie explains that Neptune has been modified from its incarnation as a relational database using 4th Dimension for Mac, and has been carefully patched over the years.

There are many such patched-up archives in the field, and some researchers start, develop, and care for data centers that drift into oblivion when funding runs out. “We call them whale fall," Lazarus says, referring to dead whales that sink to the ocean floor.

Creating a database network could keep this information alive longer and distribute it further. It could lead to new kinds of queries, says Mike ­Benton, a vertebrate paleontologist in Bristol, England, making it possible to combine independent data sources with iterative algorithms that run through millions or billions of equations. Doing this can deliver more precise time resolutions, which hitherto has been really difficult. “If you want to analyze the dynamics of ancient geography and climate and its influence on life, you need a high-resolution geological timeline," Fan says. “Right now this analysis is not available."

This article appears in the March 2020 print issue as “Data Project Aims to Organize Scientific Records."

The Conversation (0)

Why Functional Programming Should Be the Future of Software Development

It’s hard to learn, but your code will produce fewer nasty surprises

11 min read
A plate of spaghetti made from code
Shira Inbar

You’d expectthe longest and most costly phase in the lifecycle of a software product to be the initial development of the system, when all those great features are first imagined and then created. In fact, the hardest part comes later, during the maintenance phase. That’s when programmers pay the price for the shortcuts they took during development.

So why did they take shortcuts? Maybe they didn’t realize that they were cutting any corners. Only when their code was deployed and exercised by a lot of users did its hidden flaws come to light. And maybe the developers were rushed. Time-to-market pressures would almost guarantee that their software will contain more bugs than it would otherwise.

Keep Reading ↓Show less