Connecting state and local government leaders
Alternatives to traditional databases can deliver speed and simplicity.
SCRATCH THE SURFACE of any agency computer network and you will find a relational database, whether it's Oracle 11g, Microsoft SQL Server, IBM DB/2, Sun MySQL or any number of others. Databases, a mature technology, do what they do very well.
However, researchers are finding that for certain tasks, alterations of the basic relational database management system can lead to big improvements.
For example, the project team behind the General Services Administration's GSA Advantage Spend Analysis program found it could speed the response time to queries fivefold by looking beyond the realm of standard relational databases.
Instead of using a standard commercial database that arranged data in rows, the agency went with a design that compiled data into columns, said Keith Machen, operations lead for the GSA Advantage data warehouse, which offers account summaries and other reports to customers. With GSA Advantage's user-base at about 350,000, the warehouse had 20 million rows of data ' an enormous amount.
'With all the different columns and the associations with product attributes, a table scan on this could kill anything,' Machen said.
The column-oriented approach speeds queries because it cuts out a lot of extraneous information that does not need to be loaded into memory, said Richard Pledereder, vice president of engineering at Sybase.
GSA used Sybase IQ 12.7 for the task.
When the Spend Analysis project team tested a standard online transaction processing connection to fulfill requests, the connection worked well enough. But when the team tested the same dataset on a column-oriented database ' a variation of one tweaked for reporting ' the team got faster responses times. And reports scheduled to run at night, to avoid bogging down the system, could be completed almost instantly.
Database basics ' and beyond
A database differs from a plain-text file or spreadsheet in two primary respects, said Stef Damianakis, chief executive officer at Netrics, which offers database query tools for finding misspelled or badly formatted data.
As its name suggests, a relational database gets its power from the implicit and explicit relationships among tables of data, Damianakis said. A relational database assumes that a data element has uniform relationships to all elements in the same row and with its neighbors up and down the column where it resides. If, in a single row, we see a person's last name and a ZIP code, we can assume the two have a relationship to one another ' the person lives in that ZIP code. We can also assume that because that ZIP code is in a particular column, its neighbors up and down the column are also ZIP codes. Such uniformity provides the basis for powerful query capabilities.
Another defining characteristic of databases, Damianakis said, is what is called atomicity with managing the data. Each transaction made to a database is carried out step by step, so that if a transaction fails halfway through, the database system can retrace its steps and roll back to its earlier, uncorrupted and state. However, some organizations have found that altering that basic relational approach can be better suited to some customers' needs. Some approaches involve selecting only portions of the data to summarize. Others get rid of relational tables altogether to speed performance. All trade some functionality for speed, although in certain circumstances, the increase in speed is significant.
For example, consider InterSystems' Cach' database. In October 2007, the Veterans Affairs Department awarded Points Technology a contract to outfit Cach' for the agency's electronic health record system.
Cach' is a hybrid database. It responds to Structured Query Language queries, as does any other relational database. But it is also an object-oriented database, meaning data can be fetched through system calls to object classes, said Paul Grabscheid, InterSystems' vice president of strategic planning.
There are a number of reasons an object- oriented database might be prefer-able for this type of work, Grabscheid said. For example, VA's medical records are numerous and complicated, bulging with many fields of patient information.
Most programmers are more familiar with objects than tables, so the idea of pulling data through an object call would be more intuitive. And the call itself can be more efficient, because the query is not filtered through SQL calls.
Although Cach' might not be ideally suited to heavily transactional processing, its approach can help power business inquiries, said Bob Guajardo, an Apptis senior technical director working on the data warehouse. 'It is a very good option for very large databases with a high number of reads to writes,' he said.
Like object-oriented databases, columncentric databases, such as Sybase IQ, are particularly useful for data warehouse and business reporting duties, where the number of reads easily outstrip the writes, Pledereder said. With most databases, each entry is given its own row. In many cases, though, each entry could have dozens of fields. Loading a dataset with all these fields into memory to extract one bit of information is resource- and time-intensive. A database that only loads the relevant columns into memory can consume less space and be easier to query.
Scalable databases
If your dataset is getting large, you might want to mimic one of the world's largest holders of information ' Google. With hundreds of thousands of servers that deliver Web search results, Google's engineers have learned a thing or two about scalable databases. In fact, the company built a set of technologies to make sense of the billions of Web pages it indexes.
At the recent Open Source Convention (OSCON), Doug Judd of Zvents talked about his efforts to replicate one of those technologies, Google's BigTable, using open-source technologies. Judd and his team are developing an open-source database called Hypertable, which could allow other organizations to enjoy the same massive scaling as Google when it comes to database searching.
Judd said BigTable is not a relational database. It treats data as uninterpreted strings. For example, Google uses BigTable to store entire Web pages, with the URL providing the key. Running on the Google File System, BigTable can easily be distributed across many servers, with additional servers added as the demand or the data grows.
Now available as a pre-beta alpha release, Hypertable has a single 2-D table in which each row could be extended indefinitely. The approach is highly scalable and can offer a fast rate of random inserts, updates and deletions, Judd said. To increase throughput, the database can be extended across a cluster of servers. In a performance test of more than 7 million records, comprising more than a 1T of storage, a Hypertable setup could insert more than 1 million new entries a minute.
Storing documents
If talk of massive scalability and throughput seems intimidating, you might want to look in another direction ' simplicity. Some developers are building simple nonrelational databases for simply storing documents. At OSCON, open-source software consultant Jan Lehnardt presented CouchDB, an open-source nonrelational database for storing documents, one especially suited for supporting high-traffic Web sites.
Lehnardt summarized CouchDB ' funded by IBM and still in alpha release ' as a database for administrators who don't use databases. CouchDB has no fixed schema. Instead, each document has a unique ID number. Previous versions of the document have different numbers. The documents are stored as strings in the JavaScript Object Notation message exchange format, so they can be called up easily and quickly. Documents can even be fetched through simple HTTP commands directly from the browser. To speed searches, CouchDB uses another Google technology called MapReduce, a technique that involves numerically summarizing each document so similar documents can be easily grouped.
A lot of 'today's databases were hacked into placed because they weren't designed to do what they [now] do,' Lehnardt said in his presentation. As a result, administrators must spend time wrangling relational databases into doing tasks they weren't suited for. But with companies such as Sybase and InterSystems and researchers like Lehnardt, agencies could have more options when it comes to storing data.
NEXT STORY: Oracle Fusion mired in delays