Search and enjoy

 

Connecting state and local government leaders

Search companies try new techniques for understanding subtle distinctions within gigantic piles of data.

When the National Park Service set up the electronic clearinghouse for historic-preservation information, one of the toughest challenges was not technical but rather behavioral.'The brain is trained on keywords, and the most difficult part of setting this up is to stop people from thinking in keywords and start getting them to think the way we used to think ' in entire questions,' said Constance Werner Ramirez, who is the director of the Federal Preservation Institute.Using the search engine embedded in the Web portal (GCN.com/764), users really could type in a full sentence and get a better result. 'How to clean mold from books and photographs?' will lead to many results from different agencies, universities and other organizations. 'You want to give this system a lot to work with,' Ramirez said.Search engines are getting better, slowly but surely. High-end software, such as the software from Autonomy that runs the Historic Preservation portal, is making headway in offering users results they can actually use.'Basic search has remained unchanged since the mid-1960s,' noted search consultant Steve Arnold, who spoke at the Gilbane Conference on Content Technologies for Government held in Washington last year. And users are starting to notice the limits of this older technology.'When you're doing search in the enterprise, the basic search tools make finding information pretty darn hard,' he said. Make your search too wide, and you get an ocean of results with little way of finding the good stuff. Yet cast your search too narrowly, and you won't get any hits at all. And even the best basic search approaches can miss 20 percent to 35 percent of the information out there, Arnold said.Nonetheless, search engine companies are trying to add techniques to the basic search algorithms to add coherence to search results. And some of the work they are doing looks promising.The basics of search are rather simple. At the most rudimentary level, the search engine returns all the documents that contain the query phrase. This approach is called term matching.Databases allow for more flexibility in this approach, because all the material in a database is structured. Each data element is mapped to a predefined field. As a result, a query against a set of structured data can be more elaborate, allowing the user to logically hone the query using a delicate combination of fields. 'SELECT name, city FROM SampleTable ORDER BY name' basically is asking the database system to build an alphabetical list of all the names in the SampleTable.Making sense of unstructured documents ' the fancy name for all the word processing documents, spreadsheets and Web pages that make up the vast majority of an organization's content ' is a far more difficult task. The problem? A computer program won't know the relationships among all the words in a given document, or across a given set of documents. The search engine simply has to scan and index all the terms in all the documents under its purview and then offer pointers to those documents with the query term.Some headway has been made in the past few decades to make sense of these large indexes of words.There are two different approaches to refining searches, said Michael Lynch, president and founder of Autonomy. One approach is linguistic, which means the software tries to look at the relationships among the words in documents to infer some meaning about the words themselves. The other approach is probabilistic, which just looks at any statistical trends that can be mined from the documents as indicators of importance.Linguistic matching is the most ambitious but, so far, not the most successful.'Vast amounts of research [on linguistic search] has been done on this over the years, and it generally hasn't worked,' Lynch said. 'You haven't seen many commercial uses of it yet.'He said the downside to this approach is that the rules of language are not absolute, and there is a lot of meaning lost in semantic ambiguity. Take a sentence such as 'The dog walked into the room. It was furry.' Most people would assume that the dog was furry, but a computer, using strict rules of interpretation, would assume the room was furry.The linguistic approach does work well in environments where the scope of the search is quite small, and people tend to ask similar questions. 'The linguistic effort can be very good when you know what the question will be, because you can put up the standard answers,' Lynch said.Programs have had more success by disregarding semantics and focusing on mathematical techniques.One of the oldest approaches in this latter category, for instance, is looking at term frequency and inverse term frequency, noted James Melzer, an information architect at SRA International. Term frequency simply counts the number of times a term appears in a document. The more times it does, the more likely the document is about that term.But you could also derive significance from the opposite approach ' the fewer times a term appears, the more likely it is that the term represents what is unique about that document. That's called inverse term frequency. Most search companies today use a mix of those two approaches, Melzer said.A more advanced approach that builds on these basic techniques involves clustering documents that appear to be similar, based on the terms contained within the documents. Here, documents with many overlapping terms are grouped together.This approach allows the search to break free of literal term matching, as it relates documents that are similar but do not have a complete overlap of terms. A search for the word 'computer' will result in documents that may involve networking, the Internet or some other topic intimately involved with computers, even if they don't mention the word 'computing.' There are a variety of algorithms, such as Vector Space Model and Latent Semantic Indexing, that can execute this function, using differing approaches.'Each approach builds a mathematical presentation of each document but also aggregates [documents] into clusters,' Melzer said. In some cases, the user is not aware this is going on ' the clusters just serve to shape the list of results that end up on the screen. In other cases, the clusters are presented to the users.For example, do a search on 'Bunker Hill,' on the General Services Administration's USA.gov Web site, which runs on Vivisimo's clustering software via Microsoft's MSN search service. If you sort the documents by agency, you will get two major clusters, each with a different focus.Because Bunker Hill is a national park, the National Park Service has information pertaining to the visitor and historical aspects of Bunker Hill. But Bunker Hill is also an Environmental Protection Agency Superfund site, because of toxic wastes caused by decades of mining. So links to the Centers for Disease Control and Prevention's Agency for Toxic Substances and Disease Registry are also presented, under a separate grouping. By clustering these two large sets of documents, USA.gov makes it easier for users to find what they are looking for.Many search engines also use a related technique called Bayesian Inference, which looks at the mathematical distribution of words in a document and compares it with other documents, Lynch noted. As its name states, Bayesian Inference infers the major ideas behind the creation of a document, using the words as pointers to that idea.For instance, people who write about the topic of dogs will tend to use the same set of words, even if some never use the word dog itself, Lynch noted. 'The big advantage to this technology is that it adapts to a changing world,' Lynch said, adding that, as new ideas make their way through our culture, the search engine should understand them at about the same time people do.A more recent step forward in search technology has been the PageRank algorithm from Google, Melzer said. PageRank is a method of weighing the importance of each page by the number of other pages that link to that page. Because a site such as NBC.com gets plenty of links from other sites, it may rank more highly in a search on, say, television, than the site for Bob's Television Repair Shop, which may have only a few inbound links.One thing to keep in mind about Google, however, is that PageRank only works well on the Web, where pages routinely link to one another. In enterprise environments, which typically contain more stand-alone documents, linking is rare, so the effectiveness of this approach is minimal.'Generally speaking, enterprise search is not a popularity contest,' Lynch said. Nevertheless, Google and other Web search engines radically changed users' ideas of what should constitute a search term. Most people think of searches as one and two words. 'You put in the word 'Sears' and Google gives you 'Sears.com' every time,' Melzer said. Many older search-and-retrieval engines did not do this approach very well, relying on people to enter Boolean strings or other advanced querying methods.'On the Web, you have people who are typing in very short, very general search terms. I think that was the way Google revolutionized search'they weren't doing information retrieval the way everyone else was,' Melzer said. Now the enterprise search companies must catch up with the perception of what search is.Out in the marketplace, other search companies look for other approaches to bring more relevant information to users.Sometimes, information about a document can be garnered from its mere location. One of the ways search software from Isys Search Software categorizes its source material is to take into consideration the names of folders in which the documents reside.'Usually, there is some sort of structure to the directories,' said Derek Murphy, president of U.S. operations at Isys. The hierarchy can help in the categorization of the documents.Metadata can be useful in helping refine results. This is information that the user enters into the document about the document itself, such as who the author is and when it was written.Isys software, for instance, can be configured to weigh metadata higher or lower than data within the documents themselves. If no one in your organization is filling in the metadata in documents, then the organization can de-emphasize that in the search configuration. But if the organization has a content management system that creates a lot of metadata automatically, that data can be put to use.Isys has a built-in search function called 'espin' that focuses on the metadata fields. A search such as 'Einstein espin Author' would returns all those hits that carried the word Einstein, but it would put documents that listed Einstein as the author near the top, Murphy noted.Isys also uses a technique called entity extraction. As the search engine is indexing a document, it can identify things such as names, street addresses and e-mail addresses. Isys has a list of rules the engine follows to identify these traits when the document is being indexed. Organizations also can add their own rules. When a user runs a search, a list of various entities that pop up in the results is listed down the right-hand side of the page.Search engine companies also are starting to draw on other areas of computational science in hopes of giving greater context to the words on the page. In one case, business intelligence software provider Cognos has been looking at ways to export the knowledge created in BI systems to aid in search, said Paul Hulford, product marketing manager at the company.Hulford noted that the average BI customer will spend a lot of time mapping out templates for reports. So when these reports are run within the BI system, either on a scheduled interval or when requested by the user, they draw up-to-the-minute data from the organization's databases.Last fall, Cognos extended its Cognos 8 Business Intelligence platform so that its dynamically updated reports can be inserted into what a search engine can index. The results might not even be compiled before a user requests a copy of the report. The search engine will offer the user the ability to compile a report, though. In this way, BI is extending search into documents that haven't been created yet, Hulford said.Text mining is another field that is benefiting search. One text-mining company applying its efforts to search is Attensity, whose software analyzes documents and extracts nuggets of information that later can be offered to users as points of information.The software looks for who did what to whom, said Michelle de Haaff, vice president of marketing at the company. More formally, what is being extracted is something called triples, she said, which is a combination of subject, object and a predicate that defines the relationship between the two.For instance, with a given set of literature about the Boeing 737, the software could extract when the plane was first designed, when it was built and how many are flying now, if that information is found in the source document. Search engine providers can then offer these results alongside or on top of a more standard list of relevant documents. The software can work with a wide variety of formats as well, including 'Word documents, databases, PowerPoint presentations, something that has been optically scanned,' de Haaff said. 'We have no restrictions.'

Search strategies

How does a search engine pick the items it chooses to present to you when you type in a search? Most search engines are built with a number of basic techniques and trade-offs in mind.

Term-based ranking

Most search engines use a mixture of term frequency and inverse term frequency. Both are simple mathematical operations: Term frequency assumes that the documents that contain the most instances of a query term would be the most relevant to the user. Inverse document frequency assumes the opposite ' that the terms that appear least often are best indicators of relevance of that document.

Page rank/citation analysis

Pioneered by Sergey Brin and Lawrence Page, who later built Google from this technique, PageRank judges the relevance of Web pages by how many other Web pages link to that page. Although useful on the Web, the approach has limited effectiveness in most enterprise search, where linking is not common. The Google founders cribbed the idea from citation analysis, a technique for evaluating which academic papers are most important by counting the number of citations they garnered in subsequent papers.

Recall vs. precision
Most search engines strike a balance between recall and precision. Recall is the percentage of all appropriate documents returned during a search. Precision is the percentage of documents returned that are relevant to the user. In most cases, recall is inversely proportional to relevance: A 100 percent recall rate could overwhelm users ' thus reducing the precision rate ' while an overly high precision rate could leave potentially useful documents hidden.

Mike Bentley

The brain is trained on keywords and the most difficult part of setting this up is to start getting people to think in entire questions. ' Constance Werner Ramirez, Federal Preservation Institute















The basics and beyond
























Next steps















The Google influence



































NEXT STORY: Technicalities

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.