Connecting state and local government leaders
How a few well-placed tweaks can greatly improve database performance.
In the world of databases, a few small tricks can add up to big performance gains, as the data warehouse managers at the Veterans Affairs Department discovered.
To better understand how the soon-to-be-released Microsoft SQL Server 2008 operates, the department had been running an advance copy of the database management software in a production environment for moving VA clinical transaction data into a data warehouse.
The management team found that one of the newer features in SQL Server 2008 could reduce the time queries. The feature, called partitioning, lets users divide a database table into smaller units. In VA's case, a large table of information could be divided into smaller partitions, one for each year. So when a user queries all the records for just last year, the results would be returned more quickly than if the database engine had to scan all the database records, said VA data warehouse manager Jack Bates.
It's a small tweak, but it has an impact. Whether you're running a single instance of MySQL to keep some records for a small department or overseeing a server farm humming with Oracle 11g databases, you probably could benefit from tweaking your database system.
It's a neglected art form. When managers complain that their applications are running too slow ' a complaint heard all too often ' the first place administrators look is the servers. However, 'usually server tuning is not the answer,' said Sheeri Cabral, a database administrator at remote database management provider Pythian Group.
Instead, she suggested, performance gains could be had by examining and improving how database queries are structured and how the database schema was designed. Small changes can lead to improvements in speed and accuracy.
'Generally, most performance issues come back to good design,' agreed New York-based database consultant Ronald Bradford.
Be normal ' or not
When getting a database in order, one of the first steps is normalization ' paring duplicated data down to a minimum.
At the O'Reilly Open Source conference last month in Portland, Ore., Mike Hillyer, a senior sales engineer at e-mail system provider Message Systems, explained the concept of normalization.
Especially in small offices, databases start as large spreadsheets. In database terms, they are supertables, in which all the information is captured, one line per entry, on a single table. Having all the data on one page is handy at first, but as the data grows, trying to extract and aggregate useful information from a supertable becomes onerous.
Although normalization does not improve performance per se, it can improve the quality of data and clean up the design of the database, which, in turn, can speed searches.
With normalization, each entry or row of data has a primary key ' a unique key that identifies that row from the others. More important, a normalized database has little duplicative data, or items that appear more than once.
For instance, in a database of books and their authors, an author might have written more than one book ' this is called vertical redundancy. Or a book could have more than one author, causing a duplication in columns ' called horizontal redundancy.
Such redundancies can cause trouble. If a name is entered more than once, it might be spelled incorrectly at one point, causing it to be missed by searches. When an address needs to be changed, it must be changed in multiple locations. Horizontal redundancies can balloon the size of the database.
Hillyer recounted the story of working for a political campaign that kept a database of about 150,000 voters and addresses. Each time they ran surveys, they added new columns for each question into the database. 'We were afraid we'd run out of columns in the table,' he said. Moreover, most space in the database went unused. A survey might have gone to 500 people, but the column for the answers was added to all 150,000 entries. 'So 149,500 rows had columns that would never be used,' he said.
Redundancy is minimized by creating additional tables. Instead of entering a single name or address multiple times, enter it once in one table, and refer to that table whenever the information is needed. Likewise, all the authors could be given a table of their own, and the entry for each book could simply have pointers to all of its authors.
The next stage of normalization involves arranging the tables so items in each entry make a composite key ' meaning there are no entries with duplicate data. 'Every piece of data in the table must relate to the composite primary key in order to be in that table,' Hillyer said. 'Otherwise, it should be somewhere else.'
The third form of normalization is trickier, but Hillyer said that if you complete the first two forms, the third should take care of itself. In this form, all fields in a table must rely on the primary key. For example, if you are capturing an address in a database table, you don't need the name of the state in that table. You only need the address and the ZIP code. States can be defined by ZIP codes, and the relationship between the two can be captured in another table.
On the other hand, too much normalization can slow database response, Cabral said.
She offered the example of a banking account. By strict rules of normalization, the balance of a checking account should not appear anywhere in a database. After all, it can be calculated by adding together all the entries into the account. Such a procedure, however, would require a lot of resources and time whenever a customer needed a balance. It is easier to add an entry to keep the running total of the account, even though it flaunts the design laws of normalization.
'There is trade-off,' she said. 'The more data integrity you have, the more complexity you have. Are you willing to have some discrepancies in your data in order to be superfast?'
'Think about what your application is doing, and then start off with a normalized table structure,' Cabral said. 'Then go through that structure and denormalize those parts that will be slow.'
Query analysis
At the annual Usenix conference in Boston, Cabral led a session on database tuning. One technique she insisted on for speeding responses was query tuning.
Getting any information from a relational database requires a Structured Query Language query, which is a format for how to question a database. At its most basic, it requires the name of the tables to be consulted, the criteria for pulling in all matching columns, perhaps some value found within those matching columns, and some instructions on how to group the results.
There are many ways to structure a query but if you want a speedy response, experts say, write a query that makes as little work for the database as possible.
With large systems, this approach is an art form. 'I like to take a holistic approach, ask 'What is this query doing?' ' Cabral said.
Say you want to find the names of everybody logged in to some sort of online system who live in a particular ZIP code. 'Do you start by finding everyone who has logged in at first and then find out whether or not they are near me? Or do you find everyone who is near me first and then check if they are logged in,' Cabral said. The idea is to start with the smallest pool of potential candidates and then narrow down from there. The smaller the initial draw of data, the less work for the database engine.
But keep in mind that you want to have the database do as much of the sorting as possible, so it doesn't have to be done by the application itself, said Tom Kyte, Oracle's vice president of the core technology group, who writes an advice column for database questions on the Oracle Web site.
This rarely gets done, he said. An application programmer ' not a database administrator ' builds the interface from the application to the database. As a result, programmers tend to have their applications do the sorting rather than the database, which is better suited for the task, though less-understood by the programmer.
Developers tend to write procedural code, that is to say routines that do one simple task iteratively. This process can take much longer than a well-crafted SQL statement, Kyte said.
As for writing queries, Kyte suggests that administrators pack as much information as possible in the SQL statement. 'The optimizer will have lots of information about what it is you are actually trying to do, and it can come up with optimal ways of doing it,' he said.
All databases now have query analyzers, or tools that examine a query and rework it to run more efficiently. Oracle offers a query analyzer as part of its Oracle Enterprise Manager Performance Diagnostic Pack. Other tools exist for administrators to get a better view of how their queries execute, such as the open-source Snowflake and MySQL Proxy.
Know thy database
A final rule of thumb is to know the idiosyncrasies of the database being used, Bradford said.
Although all relational databases implement SQL ' in least in varying degrees of fidelity ' each database has a different set of management tools and customs. Knowing how each works can help shave response times of the database.
'MySQL does things differently than Oracle,' Bradford said. For example, Oracle generally has one data type for all numbers. Bradford once managed a MySQL database that was originally designed by a Oracle administrator who used the generic number data type often used in Oracle databases. Simply by converting the data type to a more specific MySQL data type, Bradford was able to reduce the size of the 12G database by 85 percent.
'Know your specifics. If there is an online manual, you should be reading it from cover to cover,' Bradford said.