Can your computer read a Web page without your help? Soon it might.
Connecting state and local government leaders
Tim Berners-Lee, the inventor of the Web format, and the organization that keeps the standards of the Web, the World Wide Web Consortium, have recently been promoting the idea of making the Web machine-readable, or a Web of data. What does that mean? After all, at least in one sense, the Web is already being read by a machine -- namely your own computer -- when you surf the Web.
Tim Berners-Lee, the inventor of the Web format, and the organization that keeps the standards of the Web, the World Wide Web Consortium, have recently been promoting the idea of making the Web machine-readable, or a Web of data. What does that mean? After all, at least in one sense, the Web is already being read by a machine -- namely your own computer -- when you surf the Web.
At the International Semantic Web Conference, being held this week in Chantilly, Va., Dean Allemang, chief scientist at Semantic Web consulting firm TopQuadrant, offered a solid example of how a machine-readable Web would help us all, in theory anyway.
His example was work-related: booking hotels. Say you wanted to attend a conference at some out-of-town location. The conference site itself probably has a Web site.
You copy its physical address from its site, and go to an online hotel broker site, such as Hotels.com, to find a nearby hotel. You do a search on hotels, say, by entering that address into the search criteria, to seek hotel within a certain radius. Or you just a get a list of hotels and go to a third Web site, a mapping site such as MapQuest, and enter hotel addresses and the conference center address to see if any hotel is close to the conference center.
In Allemang's view, this really is crazy. Why copy some information from one page and paste it to another, using the same computer? Why can't the computer itself do the work?
The trick would be to get all the sites to agree on how to represent an address, Allemang said. Then, the addresses can be passed from one site to the next through your browser, automatically, without you having to do anything. The mapping site could check your cache and list any addresses found there, offering you the option of mapping them.
Automating such a task (and the countless others we do by hand on our computers), is the point of creating a machine-readable Web. If computer programs can read the Web pages and carry out tasks, we won't have to.
Relational databases make the prospect feasible. With databases, you can structure data so each data element is slotted into a predictable location. You can query a database of personnel data to return a birth date of a particular person, because the row of data with that person's info has a dedicated column dedicated to the birth date.
This approach wouldn’t work so well for data beyond a single database, however. "The problem is that everyone assumes you will need to build a huge data warehouse, where everything can be compared. This will never happen," Allemang said. Another factor: On the Web, data is not structured in such a way that it can retrieved with any consistency, and the vast number of people who design and maintain Web sites would not all agree on the same format for structuring data.
The answer the W3C has come up with comes in a form of a set of interrelated standards, that can be used to embed data on Web sites, as well as to interpret the data that is found there. One standard is the Resource Description Framework. The other is the Web Ontology Language, or OWL.
RDF is a way of encoding data so it can be available for a wider audience in such a way that external IT systems can understand it. It is based on making associations. It describes data by breaking each data element into three nodes: a subject, a predicate, and object. For example, consider the fact that Yellowstone National Park offers camping. "Yellowstone" would be subject. "offers" would be the predicate and "camping" would be the "object." (All three elements get uniform resource identifiers, or a globally-recognized Internet addresses).
A query against Triple Store, which is what a RDF database is called, can link together disparate facts. If another triple, perhaps located in another Triple Store, contains the fact that Yellowstone contains the Mammoth Hot Springs, a single search across multiple Triple Stores can return both facts.
Additional standards can further refine the precision of the data definition. For instance, two parties can agree that the term "Yellowstone" refers "Yellowstone National Park" by using a shared, controlled vocabulary, which can be referenced through a Resource Description Framework schema and RDFS. RDFS also allows inferencing. In RDFS, you can state that Yellowstone is a type of national park. So a search for national parks that offer camping would return Yellowstone.
Of course, the Interior Department could build a list of all the national parks and include which services each park offers. But with the semantic Web approach, such a single database would never be needed. The services for each park could maintain their own data, and the results could be compiled only when someone posts some piece of specific data, Allemang pointed out. In essence, with RDF, a user can build a set of data from various sources on the Web that may have not been brought together before.
How do you use these triples? One way is through the query language for RDF, called SPARQL (an abbreviation for the humorously recursive SPARQL Protocol and RDF Query Language). With Structured Query language (SQL), you can query multiple database tables through the JOIN function. With a SPARQL query, you specify all the triples you would need, and the query engine will filter down to the answers that fit all of your criteria.
For instance, say you are looking for a four-star hotel in New York. You have a query to look for triples specifying for four-star hotels, and for hotels and New York. The query search engine would find all the triples for hotels in New York, as well as all the triples for four-star hotels, and filter the set down to four-star hotels in New York.
Even more sophisticated interpretations of RDF Triples can be done through OWL.
The logical chain of reason within a RDF Triple is relatively static, and can vary according to who does the encoding. One triple may say that Yellowstone "offers" camping as a service, but another triple may state that camping "is offered" Arcadia National Park. While it may seem obvious to us that both Arcadia and Yellowstone offer camping, it wouldn't be to the computer. A SPARQL query engine, perhaps one embedded in a Web application, could consult OWL and return both entries though.
While the idea of a machine-readable Web sounds great, there still requires data holders to render their material in RDF, a tall order for already-overworked Web managers. But the benefits may be worth it — once online, data can be reused in ways that government managers may never have considered.