As an agency amasses data, its IT architects are likely to find problems with consistency. Some data elements are formatted one way, others formatted differently. Some information becomes outdated but is never erased. Some is wrong and never corrected. It's a headache that only grows worse as databases expand and are aggregated.As the vice president of information quality at Firstlogic Inc. of LaCrosse, Wis., Frank Dravis is something of a guru on these matters. FirstLogic sells software that analyzes and improves the quality of enterprise data. The company got its start working government contracts and still counts among its customers the Commerce, Homeland Security and Labor departments, as well as the General Services Administration, House of Representatives and Postal Service.Dravis helps organizations work through their data quality problems. He is a member of the International Association of Information and Data Quality and writes a blog at weblogs.firstlogic.com/dravis. Dravis holds a bachelor's degree in computer science from National University in San Diego and is currently pursuing a master's in business at the University of Wisconsin. He spoke to GCN associate writer Joab Jackson by phone. Data quality is fitness for use. It is how well your data supports your own business rules and operations. How can you expect an agency to share information internally or across other agencies if it doesn't meet some common formatting standards? It won't be immediately useful if it doesn't meet some standard.The information gets thrown over the fence, and the people who catch it have to put in place their own [extract, transform and loading] system, and that is [money spent on] a lot of nonvalue add. It is gumming up the whole information pipeline. The greater the formatting problems, then the greater likelihood you're not going to go back to the source and ask for that information again. Dates have common formatting errors. There are so many ways to enter dates. Are you using dashes, slashes or periods? If you are merging data together and one data source uses slashes and another uses periods, it can be confusing to people using the data. While they may be able to decipher the dates, sooner or later it slows the whole process down.Part numbers are inherently problematic. Again, some people want to use slashes and dashes, but maybe over time they replace them with spaces. Then later, they concatenate the fields together wherever there is a space. All of a sudden, there are nine-character part numbers where there should be 10-character part numbers. They took the dashes out, replaced them with a spaces and then slammed them together. Classic stuff. We started with contracts with the Postal Service. We provided an address assignment technology that was loaded into multiline optical character recognition systems. That's how I got my start here; I was a ZIP-plus-four assignment engineer. I wrote address assignment algorithms and matching algorithms. As the mail pieces flew by, the little camera took a picture of each envelope and sent it to [our software, which] deciphered the characters. It looked the addresses up in our address database and then supplied the bar code to spread on the mail piece. The mail piece could then go into the automation mail stream.Now that was a data quality application. A lot of times the address would be slightly askew, or radically askew, and it didn't match the address database. So you had to do some fuzzy matching logic to find out what was close, and once the confidence was above a certain threshold, you could say this is the real address.That was the genesis. This was 20 years ago. I remember when my boss came up to me and said, 'Frank, we're doing address cleansing, and we need to do name cleansing. It should be a short step.' So we developed a name-cleansing, standardization and formatting algorithm.Addresses took us to names, names took us to matching, matching took us to consolidation. Wherever our customer had a data quality problem, they dragged us into that field. And so that is why our solution works on operational data. Early on, customers would come to us and say they need an address-cleansing solution. We'd sell them address-cleansing solutions, but it was a lot like someone going to the pharmacist and saying, 'I need high-blood-pressure medication.' 'Have you seen the doctor?' 'No.' 'How do you know you have high blood pressure?' 'I can feel it.'Well, now Firstlogic offers data-profiling software that measures your data against your business rules. You can quantify the level of data quality against your own thresholds [of acceptable quality] to build a return-on-investment, so you can say, 'Here are our data defects. If we fix these data defects, we will gain these benefits.' Householding is the act of [bundling] similar records. Let's use a retail example. You have [records on] Frank Dravis, Daniel Dravis, Kim Dravis and Drew Dravis, and they all have the same address. All four of them have the same phone number. The ages of two of them are over 40, and the ages of the other two are under 20. [This practice would] aggregate those four into a household to get a view of purchasing patterns.The Navy is very interested in using household views to optimize their supply chain. Some of these weapon systems are kind of old. Over time, the manufacturer of a jet engine may stop supporting that engine. Maybe an aftermarket manufacturer supports that engine. Within that engine, there may be a generator, or the turbine blades, each made by a different manufacturer. So you need to get a hierarchical view of all the vendors for the engine so you can select the most cost-effective ones, the vendors closest to your re-engineering facilities, or whatever your criteria [are].If it is the FBI, a form of household might be an associative network or actor network. Who are all the various people related to, or associate[d] with, this one person? The associations may be as tenuous as air flights. Were these two people on the same airplane, or did these two people fly to same country at the same time? This is not a simple thing you ask. It is a big project. I could give you a very short answer: It involves aggregating and integrating the various disparate data sources to a staging area, using an [extract, transform and loading] application to load all of this data into the integrated data warehouse.From there, extract the data from the data warehouse into various contextually rich data marts. Various applications will then either feed the data marts or feed from the data marts. That is the 60-second statement of a very, very big subject. I worked with a client that had five or ten records for each customer in its customer relationship management system. They didn't have any practices to guard against duplicative customer entry. Most organizations would have found that system unusable, but because individual managers used their own little subsets of the data, they understood which records were defective and should be avoided.The downside was that the information was in the heads of the practitioners. It was not organizational information. The marketing people couldn't run a report on who the top customers were, because there were too many duplicate records.Have you ever gotten duplicate mailings from the same vendor, with one title on one piece and another title on another piece? [That vendor] doesn't reconcile these duplicative contacts, and over time the problem just gets bigger. Sooner or later, [the organization] realizes it must implement some sort of managing and consolidation solution. In order for that solution to work, it has to do address and name cleansing.
Firstlogic's Frank Dravis
GCN: How do you define data quality?Dravis:GCN: How important is data formatting to sharing data?Dravis:GCN: What are some common formatting errors?Dravis:GCN: How did Firstlogic get started? Dravis:GCN: And how has the technology evolved?Dravis:GCN: I noticed a research paper you co-authored on something called 'householding' was supported by the Naval Inventory Control Point. What is householding, and what was the Navy's interest in it?Dravis:GCN: Technically speaking, how would an agency create an enterprisewide data structure?Dravis:GCN: What is the most extreme example of poor data quality that you've seen?Dravis: