It took two centuries to fill the U.S. Library of Congress in Washington, D.C., with more than 29 million books and periodicals, 2.7 million recordings, 12 million photographs, 4.8 million maps, and 57 million manuscripts. Today it takes about 15 minutes for the world to churn out an equivalent amount of new digital information. It does so about 100 times every day, for a grand total of five exabytes annually. That's an amount equal to all the words ever spoken by humans, according to Roy Williams, who heads the Center for Advanced Computing Research at the California Institute of Technology, in Pasadena.
While this stunning proliferation of information underscores the ease with which we can create digital data, our capacity to make all these bits accessible in 200 or even 20 years remains a work in progress.
In an era when the ability to read a document, watch a video, or run a simulation could depend on having a particular version of a program installed on a specific computer platform, the usable life span of a piece of digital content can be less than 10 years. That's a recipe for disaster when you consider how much we rely on stored information to maintain our scholarly, legal, and cultural record and to help us with, and profit from, our digital labor. Indeed, the ephemeral nature of both data formats and storage media threatens our very ability to maintain scientific, legal, and cultural continuity, not on the scale of centuries, but considering the unrelenting pace of technological change, from one decade to the next.
At the Massachusetts Institute of Technology Libraries, in Cambridge, where I am associate director for technology, we are attacking the problem of maintaining and sharing digital content over the long haul with a project called DSpace, which we embarked on with Hewlett-Packard Co., in Palo Alto, Calif., in 2000. For this digital repository we have built a simple, open-source software application that not only accepts digital materials and makes them available on the Web but also puts them into a data-management regime that helps to preserve them for generations to come.
The problem of digital preservation confronts everyone, even people who don't suspect that it could ever affect them. For instance, people who created their Ph.D. dissertations in WordStar in the mid-1980s can no longer read them. And there's the person who posted this question to the Obsolete Computer Museum's Web site: "I have years of letters to family and friends written by my deceased mother that are on [1983 IBM] DisplayWriter 8-inch disks. Can anyone tell me how to transfer them to a PC?"
Now view the challenge on the scale of centuries and the vast amounts of data being generated today that could be useful to tomorrow's researchers, executives, and public officials. To avoid handicapping future generations with huge blanks in the historical record, digital archivists are already fighting a pitched battle against a problem called bit rot--the well-documented degradation of data on magnetic storage media due to physical factors, such as alpha particles emitted from computer chips.
But saving raw data solves only part of the preservation problem. We also want to be able to read, play, or watch these bits when we need to. Then there are pesky legal obligations, which demand that we be able to guarantee that certain records haven't been altered by human hands or computer malfunction.
Awareness of the problem is growing rapidly, however, especially in large organizations. As official government and corporate records become entirely digital, certain obligations to keep these records around for future scrutiny must be met. In the United States, for example, the Sarbanes-Oxley Act of 2002 requires that business records, including electronic records and e-mail, be saved for "not less than five years." In some industries, such as pharmaceuticals, the regulations for record retention are much longer--30 years or more.
In Europe, the Council of Europe Convention on Cybercrime, an agreement that addresses problems posed by criminals such as identity thieves, pedophiles, and terrorists, contains provisions for governments to compel Internet service providers to preserve data that could be used as evidence in a court of law.
These new requirements, along with an increasing dependence on digital content across the board, have spurred companies, governments, and universities to devise or acquire ways to preserve just about everything stored as bits. The potential magnitude of the problem is staggering, encompassing literally every means people use to record and store information. That includes books, journals, maps, music, movies, e-mail, corporate records, government documents, course materials, data sets, and databases. It also covers scientific models, lab notebooks, parish records, family histories, and global weather data--even last night's news or Red Sox game broadcasts.
DSpace is storing and preserving materials just like these at MIT and 100 other organizations worldwide. Among them are Cornell University; the University of Toronto; the University of Cambridge, in England; the Australian National University; and the Hong Kong University of Science and Technology. Like Linux and other open-source software projects, DSpace has a growing group of committed programmers distributed across the globe who continually maintain and improve it.