Technomancer Tools: Creating a local web archive with Chrome and PageArchiver.

Sep 24 2017

Some time ago I wrote an article of suggestions for archiving web content offline, at the very least to have local copies in the event that connectivity was unavailable.  I also expressed some frustration that there didn't seem to be any workable options for the Chromium web browser because I'd been having trouble getting the viable options working.  After my attempt at fixing up Firefox fell far short of my goal (it worked for all of a day, if that) I realized that I needed to come up with something that would let me do what I needed to do.  I installed Chromium on Windbringer (I'm not a fan of Chrome because Google puts a great deal of tracking and monitoring crap into the browser and I'm not okay with that) and set to work.  Here's how I did it:

First I spent some time configuring Chromium with my usual preferences.  That always takes a while, and involved importing my bookmarks from Firefox, an automated process that took several hours to run.  I also exported everything I had cached in Scrapbook, which wound up taking all night.  I then installed the SingleFile Core plugin for Chrome/Chromium, which does the actual work of turning web pages open in browser tabs into a cacheable single file.  I restarted Chromium, which I probably didn't need to do but I really wanted a working solution so I opted for caution and then installed PageArchiver from the Chrome store and restarted Chromium again.  This added the little "open file folder" icon to the Chromium menu bar.  The order the add-ons are installed in seems to matter, add SingleFile Core first if you do nothing else.

Now get ready for me to feel stupid: If you want to store something using PageArchiver, click on the file folder icon to open the PageArchiver pop-up, click "Tabs" to show a list of tabs you have open in Chromium/Chrome, click the checkboxes for the ones you want to save, and then hit the save button.  For systems like Windbringer which have extremely high resolution screens, that save button may not be visible.  You can, however, scroll both horizontally and vertically in the PageArchiver pop-up panel to expose that button.  I didn't realize that before so I never found that button.  That's all it took.

Here's what didn't work:

I can't import my Scrapbook archives because they're sitting in a folder on Windbringer's desktop as a couple of thousand separate subdirectories, each of them containing all of the web content for a single web page.  I need to figure out what to do there.  It may consist of writing a utility that turns directories full of HTML into SQL commands to inject them into PageArchiver's SQLite database which, by default, resides in the directory $HOME/.config/chromium/Default/databases/chrome-extension_ihkkeoeinpbomhnpkmmkpggkaefincbn_0 (the directory name is constant; the jumble of letters at the end is the same as the one in the Chrome Store URL) and has the filename 2 (yes, just the number 2).  You can open it up with the SQLite browser of you choice if you wish and go poking around.  Somebody may have come up with a technique for it and I just haven't found it yet, I don't know.  I may not be able to add them in any reasonable way at all and have to resort to running an ad-hoc local web server with Python or something if I want to access them, like this:

[drwho@windbringer ~]$ python2 -m SimpleHTTPServer 8000

Saving stuff before it vanishes down the memory hole.

Jan 26 2017

UPDATE - 20170302 - Added Firefox plugin for the Internet Archive.

UPDATE - 20170205 - Added Chrome plugin for the Internet Archive.

Note: This article is aimed at people all across the spectrum of levels of experience with computers.  You might see a lot of stuff you already know; then again, you might learn one or two things that hadn't showed up on your radar yet.  Be patient.

In George Orwell's novel 1984, one of his plot points of the story was something called the Memory Hole. They were slots all over the building in which Winston Smith worked, into which documents which the Party considered seditious or merely inconvenient were deposited for incineration.  Anything that the Ministry of Truth decided had to go because it posed a threat to the party line was destroyed.  This meant that if anyone wanted to go back and double check to see what history might have been, the only thing they could get hold of were "officially sanctioned" documents written to reflect the revised Party policy.  Human memory's funny: If you don't have any static representation of something to refer back to periodically, eventually you come to think that whatever people have been telling you is the real deal, regardless of what you just lived through.  No mind tricks are necessary, just repetition.

The Net's a lot like that.  There are literally piles and piles of information everywhere you look, but most of it resides on systems that aren't yours.  This blog is running on somebody else's server, and it wouldn't take much to wipe it off the face of the Net.  All it would take is a DMCA takedown notice with no evidence (historically speaking, this is usually the case).  This has happened in the past a number of times, including to an archive maintained by Project Gutenberg and documents explicitly placed into the public domain so somebody could try to make a buck off of them.  This is a common enough thing that the IETF has made a standard HTTP error code to reflect it, Error 451 - Unavailable for legal reasons.

So, how would you make local copies of information that you think might be pulled down because somebody thought it was inconvenient?  For example, climatological data archives?