When retrieving recursively, one does not wish to retrieve loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.
For example, if you wish to download the music archive from `fly.srk.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.
Wget possesses several mechanisms that allows you to fine-tune which links it will follow.
Wget's recursive retrieval normally refuses to visit hosts different than the one you specified on the command line. This is a reasonable default; without it, every retrieval would have the potential to turn your Wget into a small version of google.
However, visiting different hosts, or host spanning, is sometimes a useful option. Maybe the images are served from a different server. Maybe you're mirroring a site that consists of pages interlinked between three servers. Maybe the server has two equivalent names, and the HTML pages refer to both interchangeably.
wget -rH -Dserver.com http://www.server.com/You can specify more than one address by separating them with a comma, e.g. `-Ddomain1.com,domain2.com'.
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \ http://www.foo.edu/
When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFs, you will not be overjoyed to get loads of PostScript documents, and vice versa.
Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the PostScript files.
Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.
Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.
Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
wget -I /people,/cgi-bin http://host/people/bozo/
wget -r --no-parent http://somehost/~luzer/my-archive/You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.
When `-L' is turned on, only the relative links are ever followed. Relative links are here defined those that do not refer to the web server root. For example, these links are relative:
<a href="foo.gif"> <a href="foo/bar.gif"> <a href="../foo/bar.gif">
These links are not relative:
<a href="/foo.gif"> <a href="/foo/bar.gif"> <a href="http://www.server.com/foo/bar.gif">
Using this option guarantees that recursive retrieval will not span hosts, even without `-H'. In simple cases it also allows downloads to "just work" without having to convert links.
This option is probably not very useful and might be removed in a future release.
The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.
To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.
Also note that followed links to FTP directories will not be retrieved recursively further.
Go to the first, previous, next, last section, table of contents.