Guerilla archival using wget.

Feb 10 2017

Let's say that you want to mirror a website chock full of data before it gets 451'd - say it's  You've got a boatload of disk space free on your Linux box (maybe a terabyte or so) and a relatively stable network connection.  How do you do it?

wget.  You use wget.  Here's how you do it:

[user@guerilla-archival:(9) ~]$ wget --mirror --continue \
    -e robots=off --wait 30 --random-wait

Let's break this down:

  • wget - Self explanatory.
  • --mirror - Mirror the site.
  • --continue - If you have to re-run the command, pick up where you left off (including the exact location in a file).
  • -e robots=off - Ignore robots.txt because it will be in your way otherwise.  Many archive owners use this file to prevent web crawlers (and wget) from riffling through their data.  Assuming this is sufficiently important, this is what you want to use.
  • --wait 30 - Wait 30 seconds between downloads.
  • --random-wait - Actually wait for 0.5 * (value of --wait) to 1.5 * (value of --wait) seconds in between requests to evade rate limiters.
  • - The URL of the website or archive you're copying.

If the archive you're copying requires a username and password to get in, you'll want to add the --user=<your username> and --password=<your password> to the above command line.

Happy mirroring.  Make sure you have enough disk space.

#datarefuge in the Bay Area - 11 February 2017.

Jan 28 2017

UPDATE: 20170131 - The Eventbrite page for this event has gone live!  Sign up!

I haven't had time to write about #datarefuge yet, in part because people a lot closer to the matter have been doing so, and much better than I could at the moment.  An entire movement has arisen around scientific data being 451'd because it's politically inconvenient, and not many of us know if it's being erased or just shut down.  We also don't know for certain if it's being copied elsewhere for safekeeping so we're doing it ourselves.  To do my part, I've been communicating with some of the organizers and having Leandra suck down data as fast as my home link will permit to store it on her RAID array.  But, the important thing:

On 11 February 2017, the Datarescue SF Bay event will held at the Berkeley Institute for Data Science from 0900 PST until 1500 PST.  That day, everybody at the event will identify data sets at risk of vanishing, work out how to best mirror them, and download them as fast as possible so they can be archived elsewhere.  Bring your drives, bring your boxen, and get ready to burn up bandwidth.