Let Web bots do the grunt work
If you want to know the World Wide Web today, talk to a robot. So-called bots are a cheap source of labor if you'd like to stay up-to-date on various kinds of information but don't have the time to search hundreds of Web home pages every day. Also known as Web-walkers, spiders or wanderers, these software agents fall into two types: personal bots set up and operated by end users, and bots offered by Internet
If you want to know the World Wide Web today, talk to a robot.
So-called bots are a cheap source of labor if you'd like to stay up-to-date on various
kinds of information but don't have the time to search hundreds of Web home pages every
day.
Also known as Web-walkers, spiders or wanderers, these software agents fall into two
types: personal bots set up and operated by end users, and bots offered by Internet
service providers.
You can tweak personal bots to act as your own executive information system, culling
information from Web sites and UseNet news groups. Caution: These bots are notoriously
prodigal with network resources, processor time and storage.
Bots operated by service providers are more common. These bots--or at least the
databases they feed--can be used for free by clients located anywhere on the Internet. The
clients usually must connect through a forms-driven interface on a Web browser. Unless a
bot is sponsored by the government or another organization, the cost of its service is
recouped from access charges or advertising.
To call a Web bot a wanderer is a misnomer, because most bots stay put on a single
Internet server and merely send out queries to other Web servers. The other servers don't
need to know whether they've been contacted by a bot or by a person with a Web browser.
They respond in the same way, sending back a copy of a home page file.
Unless bots are set to search for artwork, most will simply ignore any graphics files
associated with the home page or signal the server to stop sending the graphics.
Bots actually are a series of programs working together. One sends network queries,
another compares the culled materials against filters, others parse specific requests and
present information in a desired format.
You can assign a bot to download and inspect large numbers of Web pages automatically,
then take action based on what is found. That usually means searching for further
references via Hypertext Markup Language (HTML) pointers and downloading files from those
pages to repeat the search process. Or it can mean adding specific references to your
database or indexing words and pointers for future full-text searches.
The Lycos "Catalog of the Internet" maintained by Lycos Inc. and Carnegie
Mellon University is one of the best bot-fed reference tools available today. Similar
tools are Architext Software Inc.'s Excite database and Brian Pinkerton's WebCrawler,
sponsored by America Online Inc.
Pointers to these and other search services can be found on the Internet Search page
maintained by NetScape Communications Corp. at http://home.netscape.com/internet-search.html.
If you want to operate your own bot, it's easiest to work with a provider who can
customize sets of bots for your needs. For example, BBN Corp. of Cambridge, Mass.,
formerly Bolt Beranek and Newman Inc., has software for creating a personal information
newspaper, called PINpaper, on your Web browser using Internet content as well as in-house
resources. You can display and share this discovered information on public pages.
The Treasury Department has built some test PINpaper pages, according to BBN officials.
BBN customizes each bot system, so pricing information is sketchy, but basic modules
run about $10,000. A full system for multiple users, including server hardware, could cost
$60,000.
The problem with running your own bots is that they can get away from you. Other server
owners, for example, might resent your bot tying up their systems with extensive requests
for documents. You might run up charges of unknown magnitude if your bot hits a commercial
on-line service. And your search could even be stymied by nonsense pages that display
hundreds of popular reference words just to see how many connections they get.
Because bots are such busybodies, they pig out on resources. It's good to set them to
run at night when systems generally are underutilized.
If you're just interested in resource discovery, leave the searching to the commercial
bots like Lycos or Architext's Excite database. Lycos probably is the most extensive of
the advertiser-supported search services, but Excite allows more liberal search terms.
If you prefer to try running your own small bot, experiment with Harvest, Boulder or
WebWatch. With these tools, you can establish an organized bookmark system that will
extract data or update pointers by itself.
A good approach for a government office with a limited budget is to install Harvest on
a Unix server and then set up custom pages that autolink to services like that of Lycos
for extended searches.
If you're concerned about ill-behaved bots visiting your server, look up information
about bot exclusion at http://info.webcrawler.com/mak/projects/robots/norobots/html.
You can create and store a file called ROBOT.TXT that most visiting bots will
retrieve first. Commands in that file define your bot access policy.
There are other robot services on the Internet, including bots that log onto Internet
Relay Chat servers, keep a channel open and respond to input from other users by sending
text strings or server commands. The Eggdrop bot created by Robey Pointer is one popular
variant. Information is available at http://www.gobills.com/eggdrop.README.txt.