Let's say there's a website that you want to make a local mirror of. This means that you can refer to it offline, and you can make offline backups of it for archival. Let's further state that you have access to some server someplace with enough disk space to hold the copy, and that you can start a task, disconnect, and let it run to completion some time later, with GNU Screen for example. Let's further state that you want the local copy of the site to not be broken when you load it in a browser; all the links should work, all the images should load, and so forth. One of the quickest and easiest ways to do this is with the wget utility.
Let's say that you want to mirror a website chock full of data before it gets 451'd - say it's epadatadump.com. You've got a boatload of disk space free on your Linux box (maybe a terabyte or so) and a relatively stable network connection. How do you do it?
wget. You use wget. Here's how you do it:
[user@guerilla-archival:(9) ~]$ wget --mirror --continue \ -e robots=off --wait 30 --random-wait http://epadatadump.com/
Let's break this down:
- wget - Self explanatory.
- --mirror - Mirror the site.
- --continue - If you have to re-run the command, pick up where you left off (including the exact location in a file).
- -e robots=off - Ignore robots.txt because it will be in your way otherwise. Many archive owners use this file to prevent web crawlers (and wget) from riffling through their data. Assuming this is sufficiently important, this is what you want to use.
- --wait 30 - Wait 30 seconds between downloads.
- --random-wait - Actually wait for 0.5 * (value of --wait) to 1.5 * (value of --wait) seconds in between requests to evade rate limiters.
- http://epadatadump.com/ - The URL of the website or archive you're copying.
If the archive you're copying requires a username and password to get in, you'll want to add the --user=<your username> and --password=<your password> to the above command line.
Happy mirroring. Make sure you have enough disk space.