Mining the Tagged Web

Searching the World Wide Web for authoritative sources of information about a given topic can be a daunting task. Consulting Google to track down “jaguar,” for example, generates an alarming list of more than 7 million documents—a mad muddle of entries about cars, animals, sports teams, computers, and a town in Poland.

One reason for Google’s current success as a search engine, however, is its uncanny ability to place relevant documents high in its listings. An important component of Google’s winning recipe for judging relevance is an algorithm that tabulates “votes” on a Web page’s importance. Each link to a page counts as a vote of support for that page. Pages to which many other pages point rank higher than those to which few or no pages point.

In effect, Google takes advantage of the Web’s intricate structure, and this structure itself has been the target of considerable research.

Several years ago, researchers at the IBM Almaden Research Center in San Jose, Calif., began an effort to study the Web as a mathematical graph—a collection of nodes (representing Web pages) and lines (representing hyperlinks). They were interested in studying various properties of this graph, including its diameter and connectedness, to obtain insights into algorithms for crawling and searching the Web and to characterize the Web’s sociological evolution.

To obtain data, the researchers conducted Web crawls that encompassed 200 million pages and 1.5 billion hyperlinks. They confirmed that the distribution of pages and link number follows a simple mathematical relationship known as a power law. In essence, most pages incorporate just a few outgoing links, whereas a few pages have a huge number.

The power-law relationship indicates that the probability of a Web user coming across a document with a large number of outgoing links is significantly higher than it would be if the links were randomly distributed.

Earlier research had also suggested that two randomly chosen documents on the Web are, on average, only 19 clicks away from each other. The IBM study, conducted on a larger sample of the Web, revealed some subtleties. Significant portions of the Web cannot be reached at all from other significant portions of the Web. Moreover, although a large number of pairs of pages can be bridged, the paths may often go through hundreds of intermediate pages.

“In a sense, the Web is much like a complicated organism, in which the local structure on a microscopic scale looks very regular (like a biological cell), but the global structure exhibits interesting morphological structures (body and limbs) that are not obviously evident in the local structure,” Ravi Kumar of IBM and his coworkers concluded in a paper presented in 2000 at the Ninth World Wide Web Conference.

The effort to amass data about the structure and content of the rapidly growing Web didn’t end there. It continued and now encompasses about half of the Web and includes much “informal” communication, such as Web logs, newsgroups, and chat rooms. The resulting panoply of data has become the basis of an ambitious commercial service that IBM recently launched called WebFountain.

IBM’s supercomputer setup can process about 14,000 Web pages per second. The system reads each page, extracts its content, then automatically annotates the material. The tagged pages, often many times the length of the originals, go into a huge data storage array. About 3 billion pages (0.5 petabyte of data) are already in the system.

All this painstakingly labeled information is then available to anyone interested in looking for trends or other valuable insights into what’s going on. These users can deploy various software tools, including their own, to analyze the data and dig out relevant patterns and relationships.

A company, for example, might be able to check what sort of buzz a particular product is generating, then respond appropriately, says IBM’s Larry Proctor. He described the WebFountain project last month at a meeting of NFAIS (National Federation of Abstracting and Indexing Services) in Philadelphia (http://www.nfais.org/.

Indeed, IBM is working with a company called Factiva, which has licensed the WebFountain platform so that it can track corporate reputations and provide reports to clients spotlighting brand perceptions and industry trends.

“From the language typically used to describe a product, you can get a sense of how it’s doing,” Proctor said.

Both Google and WebFountain stemmed from academic research about text mining and the insight that the best way to find information is to focus on the biggest and most popular sites and Web pages. WebFountain goes one step further in trying to make sense of the pages themselves by tagging the information in a clear, consistent way. Any data miner that comes along now has a vast playing field on which to test its skill and prove its value.