Let's say there's a website that you want to make a local mirror of. This means that you can refer to it offline, and you can make offline backups of it for archival. Let's further state that you have access to some server someplace with enough disk space to hold the copy, and that you can start a task, disconnect, and let it run to completion some time later, with GNU Screen for example. Let's further state that you want the local copy of the site to not be broken when you load it in a browser; all the links should work, all the images should load, and so forth. One of the quickest and easiest ways to do this is with the wget utility.
UPDATE - 20170302 - Added Firefox plugin for the Internet Archive.
UPDATE - 20170205 - Added Chrome plugin for the Internet Archive.
Note: This article is aimed at people all across the spectrum of levels of experience with computers. You might see a lot of stuff you already know; then again, you might learn one or two things that hadn't showed up on your radar yet. Be patient.
In George Orwell's novel 1984, one of his plot points of the story was something called the Memory Hole. They were slots all over the building in which Winston Smith worked, into which documents which the Party considered seditious or merely inconvenient were deposited for incineration. Anything that the Ministry of Truth decided had to go because it posed a threat to the party line was destroyed. This meant that if anyone wanted to go back and double check to see what history might have been, the only thing they could get hold of were "officially sanctioned" documents written to reflect the revised Party policy. Human memory's funny: If you don't have any static representation of something to refer back to periodically, eventually you come to think that whatever people have been telling you is the real deal, regardless of what you just lived through. No mind tricks are necessary, just repetition.
The Net's a lot like that. There are literally piles and piles of information everywhere you look, but most of it resides on systems that aren't yours. This blog is running on somebody else's server, and it wouldn't take much to wipe it off the face of the Net. All it would take is a DMCA takedown notice with no evidence (historically speaking, this is usually the case). This has happened in the past a number of times, including to an archive maintained by Project Gutenberg and documents explicitly placed into the public domain so somebody could try to make a buck off of them. This is a common enough thing that the IETF has made a standard HTTP error code to reflect it, Error 451 - Unavailable for legal reasons.
A couple of weeks ago my webhosting provider sent me a polite e-mail to inform me that I was using too much disk space. A cursory examination of their e-mail showed that they were getting upset about the daily backups of my site that I was stashing in a hidden directory, and they really prefer that all files in your home directory be accessible. I ran a quick check and, sure enough, about twenty gigabytes times two weeks of daily backups adds up to a fair amount of disk space. So, the question is, how do I keep backing up all my stuff and not bother the admins any more than I have to?
Thankfully, that's a fairly straightforward operation. Beneath the cut is how I did it.