When 'Googling it' isn't enough

 

Connecting state and local government leaders

Federal agencies must become better informed about the drawbacks of relying on Web search engine technology internally for mission-critical applications, argues Johannes Scholtes, chief strategy officer at ZyLAB.

Johannes Scholtes is chief strategy officer at ZyLAB.

There is an inherent risk when a popular brand becomes the perceived archetype for a particular product group or task. Take Google as an example. Google is great for what it is designed to do – finding relevant Web sites with very high precision when given the right words to use in a search query. Admittedly, I am a big user myself. It is easy to use, fast and the results are often accurate and precise. Google does a phenomenal job keeping up with new information and often locates it quickly on popular sites.

The risk is that when a brand like “Google” becomes synonymous with “search,” there’s a tendency to not give much thought to a variety of other search technologies available.

There are many scenarios, in fact, in which appliances and Web-style enterprise search engines will not get the job done. More importantly, relying on them for certain mission-critical tasks would be a mistake.


Related story:

Why you should know the difference between search tools and discovery tools


Many Internet search engines are optimized to retrieve pre-defined, specific and precise specifications. In those instances, one must know exactly what words to use; if you do, the search result will be very precise and accurate. This is “focalized” search, a technique that provides little to no ability to explore data; it is assumed the user knows the exact terms to investigate. This fits very well in a basic retrieval model, but if one does not know exactly what words to use in the search, traditional search tools will not help.

For example, searching for all documents that present a threat to national security or finding the reasons responsible for the credit crisis requires “exploratory” search. This type of search offers techniques that can deal with imprecise specifications. More importantly, they are also dynamic and self-adopting to changing environments and datasets. They make use of many different search techniques, as well as search tools, text-mining and content-analytics, and they provide various other interactive tools to help a user find proper keywords or navigate interactively through the data.

A closer look reveals some of the limitations of three traditional Web search techniques and the implications for users who require deeper and more thorough search:

Fast crawling and indexing. In order to crawl and index as much full-text data as possible on the Internet, the traditional Web search index technology has to use optimizations and take a number of shortcuts to keep up with all the new data. There is very little time to implement complex calculations at crawling time, when a Web search engine visits new or changed Web sites and updates the internal search index.

Wildcard searches, fuzzy searches, hit highlighting, hit navigation, taxonomy-based searches and faceted search all have to be calculated at search time, when the search engine algorithms use the search index to find relevant Web pages or documents. Users will pay for this. It will either be impossible to use these functions or it will take a very long time before the system returns the results. This is a huge limitation if users do not know the exact words required for a search query.

In addition, not all occurrences of documents that contain particular words or combination of words are stored in the index – there is often a cutoff after a specific amount. This is very problematic if a Web search tool is used for e-discovery collections. Users will then only find the most popular documents and not all of them. That is hard to explain in court to opposing counsel.

Another consideration: Although most appliances and enterprise search engines do suggest alternatives to a user’s query, these are based on frequently used queries and not on similar content in the documents. Therefore, a user will miss deliberate errors or other low frequency and unexpected spelling errors. Again, this poses a major risk to intelligence, security, law enforcement and early case assessments.

Relevance ranking based on popularity. With popular Web search engines, everybody wants to be on the top of the list – they even pay money to get there. But criminals and terrorists don’t want to be found; they try to hide what they are doing. Web search engines’ relevance ranking does not overcome these circumstances. Additionally, it is often impossible to use a relevance ranking scheme other than the popularity ranking which is based on the number of incoming links. The ranking results are often unclear and different based on time and locations from where the search is executed.


Avoiding search spam. Web search engines deliberately make their relevance ranking dynamic and not 100 percent clear. The basic principles are published, but the details of the algorithm change all the time based on many different parameters, such as location, time, relevance of a site, key words used, etc. If they didn’t follow this practice, most Web search engines would soon become a victim of search spam, as Alta Vista did. Search spam will lead searches for particular words to sites that have completely different content, but that may be relevant to the searcher. It goes without saying that this type of behavior is unacceptable in a legal or intelligence environment.

The following two examples illustrate how these limitations would impact government users:

The Environmental Protection Agency disclosed a large collection of highly relevant EPA reports on the Internet that consisted of many large scanned documents (often 500 pages or more).

The low quality of some of the document scans resulted in many optical character recognition (OCR) errors. That called for a search solution other than a traditional Web search engine, using fuzzy search and sub-second hit navigation. This allows a user to overcome spelling variations of names as a result of scanning mistakes (such as for toxins and other chemicals) and navigate immediately, for instance, to a hit on page 400 of a 500-page document without having to review the rest of the document.

With a typical Google-style Web engine, it is not possible to find documents that contain such errors. It is also difficult to review retrieved documents because it will take an extremely long time to navigate through a lengthy document to the page with a relevant hit (if the hits are displayed at all). The search technology being used by EPA also provides advanced relevance ranking on any key field in the result list and there is no limitation to how users can sort the documents. “Popularity” does not play any role in the current implementation because it is irrelevant.

Another example: Federal agencies often find themselves under subpoena from Congress and must disclose all relevant e-mails and other electronic documents that contain certain keywords from certain custodians within a certain time period. Limiting searches to only meta information or only the first 20,000 e-mails would be incomplete and unacceptable, yet that is often what occurs when using a Web search engine for this type of scenario.

As you can see, for advanced search requirements, “Googling it” won’t suffice. Federal government agencies that rely solely on enterprise search appliances should understand and evaluate these limitations. If they use Google or other Web search engines for mission-critical applications such as e-discovery, intelligence, security and law enforcement investigations, it should be clear that there are serious limitations that will affect the quality and defensibility of their work.

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.