Preserving digital data for the future of eScience

From the August 30, 2008 issue of Science News

August 18, 2008 at 12:53 pm

Normal 0 false false false MicrosoftInternetExplorer4

Libraries and other archives of physical culture have been struggling for decades to preserve diverse media — from paper to eight-track tape recordings — for future generations. Scientists are falling behind the curve in protecting digital data, threatening the ability to mine new findings from existing data or validate research analyses. Johns Hopkins University cosmologist Alex Szalay and Jim Gray of Microsoft, who was lost at sea in 2007, spent much of the past decade discussing challenges posed by data files that will soon approach the petabyte (1015 — or quadrillion — byte) scale. Szalay commented on those challenges in Pittsburgh during an address at this summer’s Joint Conference on Digital Libraries and in a follow-up interview with senior editor Janet Raloff.

Scientific data approximately double every year, due to the availability of successive new generations of inexpensive sensors and exponentially faster computing. It’s essentially an “industrial revolution” in the collecting of digital data for science.

But every year it takes longer to analyze a week’s worth of data because even though the computing speed and data collecting roughly doubles annually, the ability to perform software analyses doesn’t. So analyses bog down.

It also becomes increasingly harder to extract knowledge. At some point you need new indexes to help you search through these accumulating mountains of data, performing parallel data searches and analyses.

Like a factory with automation, we need to process and calibrate data, transform them, reorganize them, analyze them and then publish our findings. To cope, we need laboratory information-management systems for these data and to automate more, creating work-flow tools to manage our pipelines of incoming data.

In many fields, data are growing so fast that there is no time to push them into some central repository. Increasingly, then, data will be distributed in a pretty anarchic system. We’ll have to have librarians organize these data, or our data systems will have to do it themselves.

And because there can be too much data to move around, we need to take our analyses to the data.

We can put digital data onto a protected system and then interconnect it via computer networks to a space in which users can operate remotely from anywhere in the world. Users get read-only privileges, so they cannot make any changes to the main database.

For the Sloan Digital Sky Survey data, we have been giving an account to anyone with an e-mail address. People with accounts can extract, customize and modify the data they use, but they have to store it in their own data space. We give them each a few gigabytes.

We currently have 1,600 users that are using [Sloan data] on a daily basis. Those data become a new tool. Instead of pointing telescopes at the sky, users can “point” at the data collected from some portion of the sky and analyze what they “see” in this virtual universe.

This is leading to a new type of eScience, where people work with data, not physical tools. Once huge data sets are created, you can expect that people will find ways to mine them in ways we never could have imagined.

But key to its success is the need for a new paradigm in publishing, where people team up to publish raw data. Perhaps in an overlay journal or as supplements to research papers. Users would be able to tag the data with annotations, giving these data added value….

The Sloan Digital Sky Survey was to be the most detailed map of the northern sky. We thought it would take five years. It took 16. Now we have to figure out how to publish the final data — around 100 terabytes [0.1 petabyte].

The final archiving of the data is in progress. There’s going to be paper and digital archives, managed by the University of Chicago and Johns Hopkins libraries.

Today, you can scan one gigabyte of data or download it with a good computer system in a minute. But with current technologies, storing a petabyte would require about 1,500 hard disks, each holding 750 gigabytes. That means it would take almost three years to copy a petabyte database — and cost about $1 million.

We generally try to geoplex, which means keeping multiple copies at remote geographic locations. That way, if there is a fire here or a meltdown there, backup copies are unlikely to be affected. We’re also trying to store data on different media. Eventually, I think we’ll probably load data on DVDs or something, which can go into cold storage. We’ll still have to recopy them periodically if we want digital data to survive a century or more.

This is something that we have not had to deal with so far. But it’s coming — the need to consider and plan for curation as data are collected. And it’s something that the National Science Foundation is looking at: standards for long-term digital-data curation.