When 'Googling it' isn't enough

By Johannes Scholtes

Connecting state and local government leaders

| June 28, 2010

Federal agencies must become better informed about the drawbacks of relying on Web search engine technology internally for mission-critical applications, argues Johannes Scholtes, chief strategy officer at ZyLAB.

Johannes Scholtes is chief strategy officer at ZyLAB.

There is an inherent risk when a popular brand becomes the perceived archetype for a particular product group or task. Take Google as an example. Google is great for what it is designed to do – finding relevant Web sites with very high precision when given the right words to use in a search query. Admittedly, I am a big user myself. It is easy to use, fast and the results are often accurate and precise. Google does a phenomenal job keeping up with new information and often locates it quickly on popular sites.

The risk is that when a brand like “Google” becomes synonymous with “search,” there’s a tendency to not give much thought to a variety of other search technologies available.

There are many scenarios, in fact, in which appliances and Web-style enterprise search engines will not get the job done. More importantly, relying on them for certain mission-critical tasks would be a mistake.

Many Internet search engines are optimized to retrieve pre-defined, specific and precise specifications. In those instances, one must know exactly what words to use; if you do, the search result will be very precise and accurate. This is “focalized” search, a technique that provides little to no ability to explore data; it is assumed the user knows the exact terms to investigate. This fits very well in a basic retrieval model, but if one does not know exactly what words to use in the search, traditional search tools will not help.

For example, searching for all documents that present a threat to national security or finding the reasons responsible for the credit crisis requires “exploratory” search. This type of search offers techniques that can deal with imprecise specifications. More importantly, they are also dynamic and self-adopting to changing environments and datasets. They make use of many different search techniques, as well as search tools, text-mining and content-analytics, and they provide various other interactive tools to help a user find proper keywords or navigate interactively through the data.

A closer look reveals some of the limitations of three traditional Web search techniques and the implications for users who require deeper and more thorough search:

Fast crawling and indexing. In order to crawl and index as much full-text data as possible on the Internet, the traditional Web search index technology has to use optimizations and take a number of shortcuts to keep up with all the new data. There is very little time to implement complex calculations at crawling time, when a Web search engine visits new or changed Web sites and updates the internal search index.

Wildcard searches, fuzzy searches, hit highlighting, hit navigation, taxonomy-based searches and faceted search all have to be calculated at search time, when the search engine algorithms use the search index to find relevant Web pages or documents. Users will pay for this. It will either be impossible to use these functions or it will take a very long time before the system returns the results. This is a huge limitation if users do not know the exact words required for a search query.

In addition, not all occurrences of documents that contain particular words or combination of words are stored in the index – there is often a cutoff after a specific amount. This is very problematic if a Web search tool is used for e-discovery collections. Users will then only find the most popular documents and not all of them. That is hard to explain in court to opposing counsel.

Another consideration: Although most appliances and enterprise search engines do suggest alternatives to a user’s query, these are based on frequently used queries and not on similar content in the documents. Therefore, a user will miss deliberate errors or other low frequency and unexpected spelling errors. Again, this poses a major risk to intelligence, security, law enforcement and early case assessments.

Relevance ranking based on popularity. With popular Web search engines, everybody wants to be on the top of the list – they even pay money to get there. But criminals and terrorists don’t want to be found; they try to hide what they are doing. Web search engines’ relevance ranking does not overcome these circumstances. Additionally, it is often impossible to use a relevance ranking scheme other than the popularity ranking which is based on the number of incoming links. The ranking results are often unclear and different based on time and locations from where the search is executed.

Avoiding search spam. Web search engines deliberately make their relevance ranking dynamic and not 100 percent clear. The basic principles are published, but the details of the algorithm change all the time based on many different parameters, such as location, time, relevance of a site, key words used, etc. If they didn’t follow this practice, most Web search engines would soon become a victim of search spam, as Alta Vista did. Search spam will lead searches for particular words to sites that have completely different content, but that may be relevant to the searcher. It goes without saying that this type of behavior is unacceptable in a legal or intelligence environment.

The following two examples illustrate how these limitations would impact government users:

The Environmental Protection Agency disclosed a large collection of highly relevant EPA reports on the Internet that consisted of many large scanned documents (often 500 pages or more).

The low quality of some of the document scans resulted in many optical character recognition (OCR) errors. That called for a search solution other than a traditional Web search engine, using fuzzy search and sub-second hit navigation. This allows a user to overcome spelling variations of names as a result of scanning mistakes (such as for toxins and other chemicals) and navigate immediately, for instance, to a hit on page 400 of a 500-page document without having to review the rest of the document.

With a typical Google-style Web engine, it is not possible to find documents that contain such errors. It is also difficult to review retrieved documents because it will take an extremely long time to navigate through a lengthy document to the page with a relevant hit (if the hits are displayed at all). The search technology being used by EPA also provides advanced relevance ranking on any key field in the result list and there is no limitation to how users can sort the documents. “Popularity” does not play any role in the current implementation because it is irrelevant.

Another example: Federal agencies often find themselves under subpoena from Congress and must disclose all relevant e-mails and other electronic documents that contain certain keywords from certain custodians within a certain time period. Limiting searches to only meta information or only the first 20,000 e-mails would be incomplete and unacceptable, yet that is often what occurs when using a Web search engine for this type of scenario.

As you can see, for advanced search requirements, “Googling it” won’t suffice. Federal government agencies that rely solely on enterprise search appliances should understand and evaluate these limitations. If they use Google or other Web search engines for mission-critical applications such as e-discovery, intelligence, security and law enforcement investigations, it should be clear that there are serious limitations that will affect the quality and defensibility of their work.

NEXT STORY: Congress to get an earful on cloud progress

This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. / Do Not Sell My Personal Information

Accept Cookies