The database of databases
Connecting state and local government leaders
The Internet and IP-based networks have gone a long way toward removing barriers between agencies. To further a spirit of cooperation, agencies are also beginningto share information from disparate databases by sending some of that data to common data warehouses, where it can be merged, queried and analyzed.
Illustration by Bruno Budrovic
The Transportation Department's Bureau
of Transportation Statistics, for example,
has created the TranStats portal
(www.transtats.bts.gov).
It lets researchers, analysts or citizens
view, search, run reports and generate
graphics from 400 transportation-related
tables in over 100 databases (one of them
with more than 250 million rows) kept by
different departments and agencies.
'The Bureau of Transportation Statistics
has three primary goals for
TranStats,' said TranStats data team
manager Cheryl Young. 'The first is to assist
public users in obtaining data easily
and efficiently, the second is to support
public and internal users who perform
online analysis and the third is to offer
data to internal users to perform advanced
statistical analysis using third-party tools.'
But setting up a data warehouse is more
involved than building a traditional database.
'We learned that data collection and
presentation involves much more than
technology,' Young said. 'It is difficult to
change the concept of data stewardship.
We would like to have focused much earlier
on this issue so that the technical
progress can follow.'
Common characteristics
While they share many common characteristics,
a data warehouse is different in design
and function from a production database.
The primary difference is whether the
system will be used for online transaction
processing or online analytic processing.
Most production databases are primarily
for OTLP and require high-speed access to
individual data records. A motor vehicles
department, for example, would need an
OLTP database to let clerks quickly access
the record of an individual car, determine if
a registration is current and update data.
The TranStats data warehouse, on the
other hand, requires a database management
system designed for OLAP. This affects
the hardware needed to support the
data warehouse, as well as the DBMS itself.
TranStats runs on four Sun Solaris
servers and a 2.5TB Sybase IQ Database
Management System from Sybase Inc.
Young said query speed, not transaction
speed, led to selecting Sybase for this application.
'Our data changes monthly or quarterly,
but our queries must be performed within
a split second,' Young said. 'Sybase IQ is
much faster than other databases and the
only one that can meet our dynamic Web
presentation criteria.'
In addition, data warehouses extract information
from a number of databases, so
they tend to be much larger than any one
of their sources.
'Scalability becomes a bigger issue with
a data warehouse, as does the way in
which the DBMS implements parallel
queries because of the heavy query nature
of the data warehouse,' said Craig Mullins,
consultant and author of Database Administration:
The Complete Guide to Practices
and Procedures (Addison-Wesley
Professional, 2002).
Data mart or data warehouse?
There are two main approaches an organization
could take in setting up a data warehouse.
Bill Inmon, considered the father of
data warehousing, favors a top-down approach.
The organization decides what it
wants to contain in the central warehouse
and how that data will be used. Lower-level
analytic databases, or 'data marts,' then
draw their data from the central warehouse.
This method ensures consistency of
data throughout the organization.
The other method is a bottom-up approach.
The data marts are set up first to
meet specific business needs, and these feed data into the warehouse. This allows
the organization to start realizing a return
sooner. Whichever method is chosen, the
overall data warehouse design should be
decided early in the process.
'In setting up a data warehouse, it is important
to have the data model correct
right from the get-go,' said Nathaniel
Booth, IT specialist at the Geological Survey
Data Center in Middleton, Wis. 'You
have piles of applications that will depend
on that data going forward, so it is really
hard to change later.'
USGS hired a consultant for the initial
design of the data warehouse and early
business intelligence deployments. In addition
to settling on the data format, one
must also select the DBMS. The decision is
complicated by the fact that, in most cases,
the organization already has some sort of
DBMS in place, and any new system must
be compatible with what already exists.
In addition, it requires interoperability
with systems at any other agency that will
be feeding data into the data warehouse.
'Assuming you already have a database,
which almost everyone in government
has, just buy more of what you have and
do a sole-source acquisition,' advises John
Kost, analyst for Gartner Inc. of Stamford,
Conn. 'The risk in any procurement is
[that] you are going to end up buying
something you hadn't expected and don't
have the internal skill sets to support.'
That was the approach USGS took in setting
up a central data warehouse for collecting
water quality data. Water doesn't
respect legislated boundaries. A pollutant
dumped in a stream in one city will flow
quickly into another municipality's water
system without stopping at the city limits,
finally winding up in a lake or ocean.
Conversely, USGS takes data streams
from state, local and federal agencies nationwide
and aggregates them in Oracle
databases at its data center. Researchers
can then conduct browser-based queries
through a flexible interface.
The largest database contains the National
Water Quality Assessment, with
more than 11 million water and biological
test results.
'We stayed with Oracle because of its
support for a wide array of functionality
including Oracle spatial, text search and
indexing,' Booth said. 'We had to bring in
data from SQL Server and Ingres databases
as well as text data.'
The data center started with Oracle 7.3
and now runs three instances of Oracle's
10g database on Red Hat Linux.
'We try to stay on the most recent version
of the database,' Booth said, though
this sometimes introduces problems.
'Sometimes the new version doesn't support
the applications you are running,
and previous queries don't always work
right, so just dealing with the change
management through the versions of the
database is somewhat of a challenge.'
He also said ensuring all the data follows
a consistent format is an ongoing effort.
'Vendors still haven't come to us with a
standardized metadata
repository, so we have a
difficult time having to
manage the metadata
about multiple products,'
he said. 'In hindsight, it
is nice to pick ones that
play well with metadata.'
'We learned that data collection and presentation involves much more than technology. It is difficult to change the concept
of data stewardship.' 'CHERYL YOUNG, TRANSPORTATION DEPARTMENT
However, when one is depending on an
array of entities with their own legacy systems
and unique data needs, managing
the data warehouse means constantly negotiating
with all interested parties.
Sonoma County, Calif., for example, set
up an IBM DB2 Geographic Information
Systems data warehouse for use by
all agencies, including agricultural inspectors,
tax assessors, the building department
and emergency services. The
county had already been using ESRI
software for some of its applications, but
there was no standard across the county.
'We were looking for a central repository
to share information, so people didn't have
to guess which version is most up-to-date,'
said GIS manager Tim Pudoff. 'By having
a central database where people can publish
their data, we can provide users with a
one-source shopping point.'
The GIS staff has real-time access to add
or update information in the warehouse,
and participating agencies periodically
send automatic or manual feeds from their
own servers. The system is overseen by a
technical advisory committee with representatives
from 16 agencies, who provide
requirements and buy-offs on any upgrades
or new services. In such a situation, Pudoff
emphasized, it is vital to have an executive
champion to help manage the competing
interests of the different parties involved.
'You need someone who can help you
wade through the bureaucratic
trials,' he said.
'I have that from my
manager, and it made all
the difference.'
Selecting the right
hardware and software
for a data warehouse involves
more than just looking at its technical
prowess or initial cost.
Looking ahead
'It is almost irrelevant what the vendor
tells you the price is,' Kost cautions. 'You
need to figure out what are the component
parts of the implementation, including
your own costs.'
This includes the hardware, services in
setting up the database, creating links to
other databases, applications or agencies,
and ongoing support.
Although an initial database purchase
might be geared to a single use, Mullins
advised breaking out the hardware, software
and services components and examining
them separately, even if the entire
contract does wind up going to a single
vendor.
'I would split out the hardware and
programming services from the DBMS
unless this particular DBMS is going to
be used only by this one project,' he said.
'A DBMS is designed for sharing data
across multiple applications, and you
don't have to always have a brand-new
DBMS for every application.'
In evaluating the DBMS, Mullins said to
look at the analytical and reporting features,
not the transaction speed.
'An RFP for a data warehouse should
focus on analytical processing instead of
transactional processing,' he said. 'The reporting
features embedded with the
DBMS become much more important, as
does the ability to manage very large
amounts of data. A DBMS with embedded
ETL [extract, transfer, load] features will
also move to the head of the pack of data
warehouse RFPs, because warehousing
implementations require heavy data
movement from operational databases to
the warehouse.'
One also has to consider who is going to
support it once the consultants wrap up
the project, which was one factor in USGS'
decision to stick with Oracle rather than a
product with a smaller installed base and
local support network.
'For us, it is important to have face time
with the consultants,' Booth said. 'It is
difficult, when you are not in a major metropolitan
area, to find skills in these very
specific technologies.'
DATA WAREHOUSING
Drew Robb is a freelance technology writer in Los Angeles.
NEXT STORY: GCN Lab Review | A scanner for all reasons