Quick and dirty copies of website with wget.

16 January 2018

Let's say there's a website that you want to make a local mirror of.  This means that you can refer to it offline, and you can make offline backups of it for archival.  Let's further state that you have access to some server someplace with enough disk space to hold the copy, and that you can start a task, disconnect, and let it run to completion some time later, with GNU screen for example.  Let's further state that you want the local copy of the site to not be broken when you load it in a browser; all the links should work, all the images should load, and so forth.  One of the quickest and easiest ways to do this is with the wget utility.

Here's how I do it:

drwho@leandra:(23) ~/archives $ wget --recursive --page-requisites 
    --convert-links --no-parent -e robots=off --random-wait -w 20

I left the shell prompt in place because the "(23)" there means "This is the twenty-fourth shell managed by GNU Screen in this session." (Screen starts counting at zero, like all sane things.)

wget is pretty self-explanatory, that's the name of the command.  The rest of the arguments could use a little explication, though.

  • --recursive - Download everything at the URL recursively.  By default, stop five (5) levels down.
  • --page-requisites - Download every file necessary to properly render and display the page.  If it's referenced in the HTML code, it'll be downloaded.  This includes JavaScript, CSS files, and images.
  • --convert-links - After an HTML page is fully downloaded, rewrite the links and references in the page to point to the right locations in the site download.
  • --no-parent - When downloading, don't climb above the URL specified.  This keeps things from getting out of control.
  • -e robots=off - Ignore the site's robots.txt file.
  • --random-wait - Take the wait time, but wait between 0.5 and 1.5 times that value.
  • -w 20 - Wait 20 seconds in between download HTTP requests.

After starting wget, you'll probably want to disconnect from the server (Screen: control-a, d) so you're not tying up the network any more than you already are and go do something else.  This can take a very long time to finish, so there's no sense in waiting around unnecessarily.