GNU Wget Manual - Host Checking

Go to the first, previous, next, last section, table of contents.

Host Checking

The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default mode for following links) all URLs the that refer to the same host will be retrieved.

The problem with this option are the aliases of the hosts and domains. Thus there is no way for Wget to know that `regoc.srce.hr' and `www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as `fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is DNS-looked-up with gethostbyname to check whether we are maybe dealing with the same hosts. Although the results of gethostbyname are cached, it is still a great slowdown, e.g. when dealing with large indices of home pages on different hosts (because each of the hosts must be and DNS-resolved to see whether it just might an alias of the starting host).

To avoid the overhead you may use `-nh', which will turn off DNS-resolving and make Wget compare hosts literally. This will make things run much faster, but also much less reliable (e.g. `www.srce.hr' and `regoc.srce.hr' will be flagged as different hosts).

Note that modern HTTP servers allows one IP address to host several virtual servers, each having its own directory hieratchy. Such "servers" are distinguished by their hostnames (all of which point to the same IP address); for this to work, a client must send a Host header, which is what Wget does. However, in that case Wget must not try to divine a host's "real" address, nor try to use the same hostname for each access, i.e. `-nh' must be turned on.

In other words, the `-nh' option must be used to enabling the retrieval from virtual servers distinguished by their hostnames. As the number of such server setups grow, the behavior of `-nh' may become the default in the future.

Go to the first, previous, next, last section, table of contents.