Mining the Sky

Taking some big bytes of the universe

The next really big observatory won’t sit on a mountaintop or out in the desert. Nor will it fly aboard a spacecraft.

Panoramic view of the sky as seen by the Digital Palomar Observatory Sky Survey. A proposed National Virtual Observatory would integrate this and a multitude of images and spectra from other surveys into one huge database, enabling astronomers–and the public–to explore the universe from their own computers. S.G. Djorgovski, Digital Media Center/Caltech

A small portion of the sky, toward the Virgo cluster of galaxies, seen in visible-light, far-infrared, and radio wavelengths. Each image comes from a different sky survey. Users of the virtual observatory could explore such combinations of sky surveys to attain a panchromatic view of the universe. Djorgovski and Digital Media Center/Caltech

Ask and ye shall receive: Diagram shows the information flow in the proposed National Virtual Observatory, from a detailed question asked by an astronomer to the answer (at right). Computers close to the archived data, rather than the astronomer’s computer, perform the processing. Hanisch/Space Telescope Science Institute

Astronomers seeking the best views of the heavens have traditionally trekked to such remote outposts as the top of an extinct volcano in Hawaii or a desert in northern Chile. When they have needed to observe, say, one patch of sky over a broad range of wavelengths, they had to make separate observations at several sites. And to use one of the world’s biggest light detectors, they could only hope to be lucky enough to get a few hours of its precious time.

If an ambitious new project proves successful, however, those difficulties may fade into memory. With the unprecedented torrent of raw images and spectra flowing into databases from the world’s telescopes and a novel melding of powerful computers, whatever an astronomer desires may soon be just a click away.

At a meeting of the American Astronomical Society in San Diego last month, scientists described plans for a National Virtual Observatory (NVO), a mammoth, ever-expanding archive of images, spectra, and other information covering the entire sky. That multi-wavelength database, a one-stop-shopping emporium of several large sky surveys, could usher in a new age of discovery, says George Djorgovski of the California Institute of Technology in Pasadena.

“There may be phenomena that we’ve never seen before because we didn’t have this kind of data-base to work with,” notes Robert J. Hanisch of the Space Telescope Science Institute in Baltimore.

A prototype of the virtual observatory could be in place in just 18 months. If NASA and the National Science Foundation come up with enough cash–estimates vary between $65 million and several times that amount–a complete system could begin operation in 5 years. It would contain more bytes of information than all the books stored in the Library of Congress.

Enabling astronomers at their desktops to compare and study images of galaxies and stars over a multitude of wavelengths, the virtual observatory would permit researchers at a small teaching college or in a third-world country to make discoveries as easily as someone at Harvard or Princeton. In principle, any knowledgeable person mining the cosmic database could strike gold–a new quasar, a strange star, or some completely novel class of objects.

“For all the clever people who don’t have access to a big telescope, NVO will allow them do first-rate observational astronomy,” says Djorgovski.

Sky survey and space mission data

Data from spacecraft missions and ground-based sky surveys have been posted on the Internet for a decade. Yet information from each survey is stored and described in its own characteristic way. Researchers sometimes require months to recast the data into a form they can analyze. X-ray astronomers, for example, may refer to a particular wavelength of light in terms of angstroms, another team may describe it in nanometers, and still another in terms of the color filter they used to record that radiation.

By creating a universal language, an astronomical version of Esperanto, NVO would enable scientists to compare different sets of data in a few minutes rather than a few months, notes Hanisch.

“What we’re trying to do is meld together all of these [sky surveys] and provide a seamless interface . . . so that they appear to be one uniform, consistent data set,” explains computer scientist Jim Gray of Microsoft Research in San Francisco.

That effort couldn’t have come at a more critical time. Over the past decade, the amount of data on stars, galaxies, and other members of the cosmic zoo has doubled every 2 years, and that astronomical trend seems to be continuing.

In 1995, the Hubble Space Telescope stared at a tiny patch of the northern sky for 2 weeks, taking the deepest and faintest images ever recorded in visible light. Many astronomers have used the data compiled from that region of sky, known as the Hubble Deep Field North, as a basis for further studies at a variety of wavelengths (SN: 1/20/96, p. 36).

“We did an extremely careful set of observations over a rather small but specific area of the sky, and we made it available to everyone,” notes Ethan J. Schreier of the Space Telescope Science Institute. “Now, the NVO would take that [concept] to a whole new level.”

Since the completion of the Hubble Deep Field North, several sky surveys have intensified the flood of data. These include the Two-Micron All-Sky Survey in the infrared, the Digital Palomar Observatory Sky Survey (DPOSS), a digital rendition of photographic plates taken at the Palomar Observatory in California, and two radio-wavelength studies, the Very Large Array’s Sky Survey and the Faint Images of the Radio Sky at Twenty Centimeters.

But the mother of all cosmic censuses is the Sloan Digital Sky Survey (SN: 6/12/99, p. 379). Using a telescope at Fort Apache, N.M., to observe the entire northern sky in five different colors, this 5-year study will ultimately compile some 40 terabytes of data, 10 times more than any survey that has come before it. Only in its first full year of operation, Sloan has already discovered 100,000 quasars, including 40 of the most distant ones. Astronomers expect it to record the colors and shapes of more than a billion galaxies.

Another survey that has been proposed would add to the rising tide of data. It would image the entire sky every 4 days, providing an unprecedented record of how stars and galaxies vary over time.

“We’re getting to the point where the most precious resource becomes human attention,” says Gray. “Most of the bytes of the Sloan survey will never be seen by humans. We’re talking about amounts of data that are far beyond the capabilities of people to deal with. This is a different style of astronomy than people have done in the past.”

Traditionally, astronomers receive only a few hours of observing time on expensive, overbooked telescopes. The researchers take their data, process it, and then write a journal article reporting the results. That approach sharply contrasts with the Sloan survey, in which an army of more than 70 astronomers spent 12 years designing an instrument, getting it to work, and building a network so that anyone could look at the results.

A virtual observatory won’t replace telescopes–they are, after all, the devices that provide the data. But it could make trips to a mountaintop observatory less frequent, help astronomers to hone their research questions, and obtain more knowledge from the data collected, notes Hanisch.

Djorgovski says that he first sensed the need for a shift in the practices of astronomy several years ago when he and his colleagues started digitizing photographic plates taken at the Palomar Observatory. “We didn’t want these photographic images to go to waste, and it became totally clear to me that this is the way to go in astronomy–to have an information-intensive database that is easy to access,” he recalls.

Djorgovski notes that as recently as 1998, only a handful of astronomers, including Hanisch and Alexander S. Szalay of Johns Hopkins University in Baltimore, were working on the concept of a virtual observatory. Last year, however, a National Academy of Sciences panel named the NVO one of the highest-priority projects for astronomers to build over the next decade. Since that report, scores of researchers have joined the effort, he says.

The virtual observatory also has piqued the interest of computer scientists. Consider a database chock-full of intriguing objects–stars, galaxies, and quasars–each with more than 100 attributes. It’s a wonderful proving ground for testing computational theories, says Gray. He proposes that novel types of hardware will be essential for sifting through the data. One possibility is smart disks–disks endowed with enough memory and command capabilities to act as a supercomputer.

Looking for unusual patterns

Using computers to look for unusual patterns in mountain of data is becoming a more and more common both in science and in other venues, notes Gray. For years, biologists have used data-mining techniques to scan massive databases of genetic information in search of patterns that can shed light on evolution or reveal the basis of diseases. Credit card companies now rely on data-mining to rapidly identify unusual spending patterns that may reflect the theft of a credit card.

Undoubtedly the NVO will also rely on data-mining techniques. The virtual observatory will require parallel computing, in which a cluster of workstations tackles a problem simultaneously. To get quick answers to queries, much of the computer processing should lie close to the stored data. There’s simply too much data to efficiently transfer over communication lines.

So, in many cases, processors at the site of the stored data will distill perhaps a single number from terabytes of information. For example, an astronomer at a distant work station might tap in a few commands and get back a figure, say, for the brilliance of a newly discovered quasar, rather than having to slowly download an immense amount of information and process it locally.

But Gray sees differences between the needs of astronomers and other data miners. Although computer programs designed for genetics research and astronomy must sift through about the same mind-numbing amount of information, the genetic patterns expected are more clearly defined. Astronomers may be looking for much more complex, larger-scale, and subtler patterns, such as the distribution of galaxies over the breadth of the universe, he says.

Another difference: Financial interests keep much of the accruing genetic data under wraps, but almost all astronomical observations become public, notes Gray.

The trend to make astronomical data generally available is likely to become even stronger, predicts Djorgovski. “When you cover the entire northern sky with several surveys, there’s a zillion things you can do,” he says. Even if a few researchers insist on keeping information for their own perusal, the vast majority of data will be widely available. Says Djorgovski: “I’m not worried.”

Instead, he says, he has a deeper and more philosophical concern. “When people think of NVO, they think of a library of data, the sort of archive in which you can ask questions like Tell me everything you know about the galaxy NGC 164.’

“The dumbest thing we could possibly do is just extrapolate from the stuff we already know,” he continues. “The NVO can’t just be a glorified library. This quantitative change in the amount of data has to drive qualitative changes in the way we do astronomy.

“To me, the most interesting part if this is . . . What new questions can I ask about the universe?”