It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
While Wget is retrieving static pages, there's not much of a problem. But for Wget, there is no real difference between a static page and the most demanding CGI. For instance, a site I know has a section handled by an, uh, bitchin' CGI script that converts all the Info files to HTML. The script can and does bring the machine to its knees without providing anything useful to the downloader.
For such and similar cases various robot exclusion schemes have been devised as a means for the server administrators and document authors to protect chosen portions of their sites from the wandering of robots.
The more popular mechanism is the Robots Exclusion Standard, or RES, written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are supposed to download and parse.
Wget supports RES when downloading recursively. So, when you issue:
wget -r http://www.server.com/
First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft `<draft-koster-robots-00.txt>' titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META
tag, like
this:
<meta name="robots" content="nofollow">
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual `/robots.txt' exclusion.
Go to the first, previous, next, last section, table of contents.