1 . 4 Worms, Spiders, and Knowbots
Web worms, spiders, robots, and knowbots are automated tools that crawl around the Web looking for information, reporting their findings. Many of the so-called Internet Starting Points use robots to scour the Web looking for new information. These automatons can be used both to search for information about a particular topic of interest or to build up databases. for subsequent searching by others. (See Section 1. 9 Internet Starting Points for information on searching the net.)
A worm is a program that moves from one site to another. The generic term "worm" has nothing to do with the Web; it simply refers to a program that seeks to replicate itself on multiple hosts. Worms are not necessarily good. The "Internet Worm" of 1988 caused a massive breakdown of thousands of systems on the Internet. But that's another story.(8)
A knowbot is a program or agent that, like worms, travels from site to site. However it has a flavor of artificial intelligence in that it usually follows knowledge-based rules. Another term for a knowbot might be an autonomous agent. Clear distinctions between these terms are currently not meaningful.(9) In the context of this section, finding information, we'll look below at one particular knowbot and one worm. First, however, we'll mention spiders.
Spiders, as their name implies, crawl around the Web, doing things. They can find information to build large textual databases; the WebCrawler does this. They can also maintain large Webs or collections of Webs; this is the function of the MOMspider.
Following is Brian Pinkerton's description of one Web worm, the WebCrawler:
The WebCrawler is a web robot, and is the first product of an experiment in information discovery on the Web. I wrote it because I could never find information when I wanted it, and because I don't have time to follow endless links.
The WebCrawler has three different functions:
It builds indices for documents it finds on the Web. The broad, content-based index is available for searching. It acts as an agent, searching for documents of particular interest to the user. In doing so, it draws upon the knowledge accumulated in its index, and some simple strategies to bias the search toward interesting material. In this sense, it is a lot like the Fish search, although it operates network-wide. It is a testbed for experimenting with Web search strategies. It's easy to plug in a new search strategy, or ask queries from afar, using a special protocol.
In addition, the WebCrawler can answer some fun queries. Because it models the world using a flexible, OO (Ed. Object Oriented) approach, the actual graph structure of the Web is available for queries. This allows you, for instance, to find out which sites reference a particular page. It also lets me construct the Web Top 25 List, the list of the most frequently referenced documents that the WebCrawler as found.
How it Works
The WebCrawler works by starting with a known set of documents (even if it is just one), identifying new places to explore by looking at the outbound links from that document, and then visiting those links.
It is composed of three essential pieces:
The search engine directs the search. In a breadth-first search, it is responsible for identifying new places to visit by looking at the oldest unvisited links from documents in the database. In the directed, find-me-what-I-want strategy, the search engine directs the search by finding the most relevant places to visit next. The database contains a list of all documents, both visited and unvisited, and an index on the content of visited documents. Each document points to a particular host, and, if visited, contains a list of pointers to other documents (links). "Agents" retrieve documents. They use CERN's WWW library to retrieve a specific URL, then returning that document to the database for indexing and storage. The WebCrawler typically runs with 5-10 agents at once.
Being a Good Citizen
The WebCrawler tries hard to be a good citizen. Its main approach involves the order in which it searches the Web. Some web robots have been known to operate in a depth-first fashion, retrieving file after file from a single site. This kind of traversal is bad. The WebCrawler searches the Web in a breadth-first fashion. When building its index of the Web, the WebCrawler will access a site at most a few times a day.
When the WebCrawler is searching for something more specific, its search may narrow to a relevant set of documents at a particular site. When this happens, the WebCrawler limits its search speed to one document per minute and sets a ceiling on the number of documents that can be retrieved from the host before query results are reported to the user. The WebCrawler also adopts several of the techniques mentioned in the Guidelines for Robot Writers.
Implementation Status
The WebCrawler is written in C and Objective-C for NEXTSTEP. It uses the WWW library from CERN, with several changes to make automation easier. Whenever I feel comfortable about unleashing the WebCrawler, I'll make the source code available!
bp@cs.washington.edu
Brian Pinkerton
MOMspider, available for free from the University of California, Irvine, is used to help maintain Webs. It is written in PERL and runs on most UNIX systems. MOMspider was written by Roy T. Fielding and a paper titled "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"(10) was presented at the WWW94 conference in Geneva. From Fielding's paper:
MOMspider gets its instructions by reading a text file that contains a list of options and tasks to be performed (an example instruction file is provided in Appendix A). Each task is intended to describe a specific infostructure so that it can be encompassed by the traversal process. A task instruction includes the traversal type, an infostructure name (for later reference), the "Top" URL at which to start traversing, the location for placing the indexed output, an e-mail address that corresponds to the owner of that infostructure, and a set of options that determine what identified maintenance issues justify sending an e-mail message.
Appendix A # MOMspider-0.1a Instruction File SystemAvoid /usr/local/httpd/admin/avoid.mom SystemSites /usr/local/httpd/admin/sites.mom AvoidFile /usr/grads/fielding/test/.momspider-avoid SitesFile /usr/grads/fielding/test/.momspider-sites SitesCheck 7 <Site Name ICS TopURL http://www.ics.uci.edu/ICShome.html IndexURL http://www.ics.uci.edu/Admin/ICS.html IndexFile /usr/local/httpd/documentroot/MOM/ICS.html IndexTitle MOMspider Index for All of ICS EmailAddress www@ics.uci.edu EmailBroken EmailExpired 2 > <Tree Name MOMspider-WWW94 TopURL http://www.ics.uci.edu/WebSoft/MOMspider/WWW94/paper.html IndexURL http://www.ics.uci.edu/Admin/MOMspider-WWW94.html IndexFile /usr/local/httpd/documentroot/Admin/MOMspider-WWW94.html IndexTitle MOMspider Index for Roy's WWW94 Paper EmailAddress fielding@ics.uci.edu EmailBroken > <Owner Name RTF TopURL http://www.ics.uci.edu/~fielding/hotlist.html IndexURL http://www.ics.uci.edu/~fielding/MOM/RTF.html IndexFile /usr/grads/fielding/public_html/MOM/RTF.html EmailAddress fielding@ics.uci.edu EmailBroken EmailChanged 3 EmailExpired 7 >Finally, rest assured that not all bots and spiders must be run from expensive workstations. An really cool product called Surfbot(11) from Surflogic LLC runs just fine on Win95 PCs. Surfbot lets you configure your own private "agents" to traverse either known Internet Starting Points or your own set of bookmarks. Its Wizard type of set up configures the agents and can produce a variety reports. This set up is simple to use.
Surfbot control screen for configuring one particular agent.One particularly nice feature is its ability to schedule the times for searching. It makes the modem connection for you and hangs up when done. You can make your agents fire up in the middle of the night so the results will be waiting for you in the morning!
Skip to chapter[1][2][3][4][5][6][7][8][9]
| © Prentice-Hall, Inc. A Simon & Schuster Company Upper Saddle River, New Jersey 07458 |