The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default mode for following links) all URLs the that refer to the same host will be retrieved.
The problem with this option are the aliases of the hosts and domains.
Thus there is no way for Wget to know that `regoc.srce.hr' and
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is
the same as `fly.cc.etf.hr'. Whenever an absolute link is
encountered, the host is DNS-looked-up with gethostbyname
to
check whether we are maybe dealing with the same hosts. Although the
results of gethostbyname
are cached, it is still a great
slowdown, e.g. when dealing with large indices of home pages on different
hosts (because each of the hosts must be and DNS-resolved to see
whether it just might an alias of the starting host).
To avoid the overhead you may use `-nh', which will turn off DNS-resolving and make Wget compare hosts literally. This will make things run much faster, but also much less reliable (e.g. `www.srce.hr' and `regoc.srce.hr' will be flagged as different hosts).
Note that modern HTTP servers allows one IP address to host several
virtual servers, each having its own directory hieratchy. Such
"servers" are distinguished by their hostnames (all of which point to
the same IP address); for this to work, a client must send a Host
header, which is what Wget does. However, in that case Wget must
not try to divine a host's "real" address, nor try to use the same
hostname for each access, i.e. `-nh' must be turned on.
In other words, the `-nh' option must be used to enabling the retrieval from virtual servers distinguished by their hostnames. As the number of such server setups grow, the behavior of `-nh' may become the default in the future.
Go to the first, previous, next, last section, table of contents.