NARA conference demonstrates emulation technologies
Connecting state and local government leaders
Researchers are working on techniques for extracting data locked in obsolete formats and file systems.
One of the most difficult problems facing agencies that perform electronic archiving is assuring that files will be able to be read by the electronic devices of the future.
Researchers are working on new techniques for extracting data locked in obsolete formats and file systems, as evidenced by a number of demonstrations at a recent National Archives and Records Administration symposium.
Robert Wilensky, a professor of computer science at the University of California, Berkeley, showed off software that could be the basis of a universal document viewer, one that could display or even modify any document, regardless of its format. Researchers at Berkeley produced the software, called Multivalent, as part of a $1 million National Science Foundation Digital Libraries Initiative Phase II award. Development took one year's worth of programmer time, according to Wilensky.
The program itself is an empty shell that can hold modules to read particular formats. As new formats emerge, developers can write adapters for the platform that can read the documents in these formats. The program can now read HTML pages, documents in portable document format, plain text documents and those encoded in the TeX mathematical typesetting format. Readers can also annotate documents by using another add-on module.
In order to ensure that Multivalent would run on most platforms, developers wrote the program in Java. Since it runs on top of the Java Virtual Machine, it can work without modification on any operating system that supports Java. Wilensky said the team chose Java because of its current popularity with programmers as a cross-platform language.
James Myers, an Energy Department chief scientist at the Pacific Northwest National Laboratory, demonstrated an Extensible Markup Language-formatted specification he is helping to write that could describe how to extract data from a binary file without assistance from the program that formatted the file. The language is called the Data Format Description Language, or DFDL.
Each archived document could be attached to a DFDL description, which would contain instructions on how to interpret the raw bits to retrieve the information it holds. A software parser can read the DFDL description and then extract the appropriate information from the file. The DFDL could describe how to extract all the information within a document, or only certain fields.
Myers said this approach could reduce the worry of picking the correct standards for long-term archiving.
"There are a lot of meetings to try to figure out a standard to use. But who knows what the data will be used for?" he said.
One symposium attendee noted that this approach might work well for mathematical data, such as numerical arrays produced by scientists. But it may not work as well for images or motion pictures, because it would be more difficult to locate in the binary files the exact bits holding discrete elements such as faces or objects.
NEXT STORY: DHS to upgrade enforcement systems