Go to the first, previous, next, last section, table of contents.


Time-Stamping

One of the most important aspects of mirroring information from the Internet is updating your archives.

Downloading the whole archive again and again, just to replace a few changed files is expensive, both in terms of wasted bandwidth and money, and the time to do the update. This is why all the mirroring tools offer the option of incremental updating.

Such an updating mechanism means that the remote server is scanned in search of new files. Only those new files will be downloaded in the place of the old ones.

A file is considered new if one of these two conditions are met:

  1. A file of that name does not already exist locally.
  2. A file of that name does exist, but the remote file was modified more recently than the local file.

To implement this, the program needs to be aware of the time of last modification of both local and remote files. We call this information the time-stamp of a file.

The time-stamping in GNU Wget is turned on using `--timestamping' (`-N') option, or through timestamping = on directive in `.wgetrc'. With this option, for each file it intends to download, Wget will check whether a local file of the same name exists. If it does, and the remote file is older, Wget will not download it.

If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the time-stamps say.

Time-Stamping Usage

The usage of time-stamping is simple. Say you would like to download a file so that it keeps its date of modification.

wget -S http://www.gnu.ai.mit.edu/

A simple ls -l shows that the time stamp on the local file equals the state of the Last-Modified header, as returned by the server. As you can see, the time-stamping info is preserved locally, even without `-N' (at least for HTTP).

Several days later, you would like Wget to check if the remote file has changed, and download it if it has.

wget -N http://www.gnu.ai.mit.edu/

Wget will ask the server for the last-modified date. If the local file has the same timestamp as the server, or a newer one, the remote file will not be re-fetched. However, if the remote file is more recent, Wget will proceed to fetch it.

The same goes for FTP. For example:

wget "ftp://ftp.ifi.uio.no/pub/emacs/gnus/*"

(The quotes around that URL are to prevent the shell from trying to interpret the `*'.)

After download, a local directory listing will show that the timestamps match those on the remote server. Reissuing the command with `-N' will make Wget re-fetch only the files that have been modified since the last download.

If you wished to mirror the GNU archive every week, you would use a command like the following, weekly:

wget --timestamping -r ftp://ftp.gnu.org/pub/gnu/

Note that time-stamping will only work for files for which the server gives a timestamp. For HTTP, this depends on getting a Last-Modified header. For FTP, this depends on getting a directory listing with dates in a format that Wget can parse (see section FTP Time-Stamping Internals).

HTTP Time-Stamping Internals

Time-stamping in HTTP is implemented by checking of the Last-Modified header. If you wish to retrieve the file `foo.html' through HTTP, Wget will check whether `foo.html' exists locally. If it doesn't, `foo.html' will be retrieved unconditionally.

If the file does exist locally, Wget will first check its local time-stamp (similar to the way ls -l checks it), and then send a HEAD request to the remote server, demanding the information on the remote file.

The Last-Modified header is examined to find which file was modified more recently (which makes it "newer"). If the remote file is newer, it will be downloaded; if it is older, Wget will give up.(2)

When `--backup-converted' (`-K') is specified in conjunction with `-N', server file `X' is compared to local file `X.orig', if extant, rather than being compared to local file `X', which will always differ if it's been converted by `--convert-links' (`-k').

Arguably, HTTP time-stamping should be implemented using the If-Modified-Since request.

FTP Time-Stamping Internals

In theory, FTP time-stamping works much the same as HTTP, only FTP has no headers--time-stamps must be ferreted out of directory listings.

If an FTP download is recursive or uses globbing, Wget will use the FTP LIST command to get a file listing for the directory containing the desired file(s). It will try to analyze the listing, treating it like Unix ls -l output, extracting the time-stamps. The rest is exactly the same as for HTTP. Note that when retrieving individual files from an FTP server without using globbing or recursion, listing files will not be downloaded (and thus files will not be time-stamped) unless `-N' is specified.

Assumption that every directory listing is a Unix-style listing may sound extremely constraining, but in practice it is not, as many non-Unix FTP servers use the Unixoid listing format because most (all?) of the clients understand it. Bear in mind that RFC959 defines no standard way to get a file list, let alone the time-stamps. We can only hope that a future standard will define this.

Another non-standard solution includes the use of MDTM command that is supported by some FTP servers (including the popular wu-ftpd), which returns the exact time of the specified file. Wget may support this command in the future.


Go to the first, previous, next, last section, table of contents.