Guerilla archival using wget.

Feb 10, 2017

Let's say that you want to mirror a website chock full of data before it gets 451'd - say it's epadatadump.com.  You've got a boatload of disk space free on your Linux box (maybe a terabyte or so) and a relatively stable network connection.  How do you do it?

wget.  You use wget.  Here's how you do it:

[user@guerilla-archival:(9) ~]$ wget --mirror --continue \
    -e robots=off --wait 30 --random-wait http://epadatadump.com/

Let's break this down:

  • wget - Self explanatory.
  • --mirror - Mirror the site.
  • --continue - If you have to re-run the command, pick up where you left off (including the exact location in a file).
  • -e robots=off - Ignore robots.txt because it will be in your way otherwise.  Many archive owners use this file to prevent web crawlers (and wget) from riffling through their data.  Assuming this is sufficiently important, this is what you want to use.
  • --wait 30 - Wait 30 seconds between downloads.
  • --random-wait - Actually wait for 0.5 * (value of --wait) to 1.5 * (value of --wait) seconds in between requests to evade rate limiters.
  • http://epadatadump.com/ - The URL of the website or archive you're copying.

If the archive you're copying requires a username and password to get in, you'll want to add the --user=<your username> and --password=<your password> to the above command line.

Happy mirroring.  Make sure you have enough disk space.