Feb 10 2017
Let's say that you want to mirror a website chock full of data before it gets 451'd - say it's epadatadump.com. You've got a boatload of disk space free on your Linux box (maybe a terabyte or so) and a relatively stable network connection. How do you do it?
wget. You use wget. Here's how you do it:
[user@guerilla-archival:(9) ~]$ wget --mirror --continue \
-e robots=off --wait 30 --random-wait http://epadatadump.com/
Let's break this down:
- wget - Self explanatory.
- --mirror - Mirror the site.
- --continue - If you have to re-run the command, pick up where you left off (including the exact location in a file).
- -e robots=off - Ignore robots.txt because it will be in your way otherwise. Many archive owners use this file to prevent web crawlers (and wget) from riffling through their data. Assuming this is sufficiently important, this is what you want to use.
- --wait 30 - Wait 30 seconds between downloads.
- --random-wait - Actually wait for 0.5 * (value of --wait) to 1.5 * (value of --wait) seconds in between requests to evade rate limiters.
- http://epadatadump.com/ - The URL of the website or archive you're copying.
If the archive you're copying requires a username and password to get in, you'll want to add the --user=<your username> and --password=<your password> to the above command line.
Happy mirroring. Make sure you have enough disk space.
Jan 26 2017
UPDATE - 20170302 - Added Firefox plugin for the Internet Archive.
UPDATE - 20170205 - Added Chrome plugin for the Internet Archive.
Note: This article is aimed at people all across the spectrum of levels of experience with computers. You might see a lot of stuff you already know; then again, you might learn one or two things that hadn't showed up on your radar yet. Be patient.
In George Orwell's novel 1984, one of his plot points of the story was something called the Memory Hole. They were slots all over the building in which Winston Smith worked, into which documents which the Party considered seditious or merely inconvenient were deposited for incineration. Anything that the Ministry of Truth decided had to go because it posed a threat to the party line was destroyed. This meant that if anyone wanted to go back and double check to see what history might have been, the only thing they could get hold of were "officially sanctioned" documents written to reflect the revised Party policy. Human memory's funny: If you don't have any static representation of something to refer back to periodically, eventually you come to think that whatever people have been telling you is the real deal, regardless of what you just lived through. No mind tricks are necessary, just repetition.
The Net's a lot like that. There are literally piles and piles of information everywhere you look, but most of it resides on systems that aren't yours. This blog is running on somebody else's server, and it wouldn't take much to wipe it off the face of the Net. All it would take is a DMCA takedown notice with no evidence (historically speaking, this is usually the case). This has happened in the past a number of times, including to an archive maintained by Project Gutenberg and documents explicitly placed into the public domain so somebody could try to make a buck off of them. This is a common enough thing that the IETF has made a standard HTTP error code to reflect it, Error 451 - Unavailable for legal reasons.
So, how would you make local copies of information that you think might be pulled down because somebody thought it was inconvenient? For example, climatological data archives?
Jan 20 2017
Not too long ago, when the USB key I'd built a set-top media machine died from overuse I decided to rebuild it using Arch Linux with Kodi as the media player. The trick, I keep finding every time, lies in getting Kodi to start up whenever the machine starts up. I think I've re-figured that out six or seven times by now, and each time after it works I forget all about it. So, I guess I'd better write it down for once so that I've got a snapshot of what I did in case I need to do it again later.
The instructions in the Arch Linux wiki work, but you need to pick the right ones to follow. The short-and-sweet ones with the automagickal AUR package don't work. Forget it.
Install LightDM from the Arch package repository (sudo pacman -S lightdm). Then install the instructions I linked to above to the letter. That means carrying out the following tasks:
Create the file /etc/X11/Xwrapper.config. The file should contain only the following text in bold (no double quotes): "needs_root_rights = yes"
Follow the LightDM "Enabling autologin" and "Enabling interactive passwordless login" instructions. Create a user named "kodiuser" (you don't need to set a password" and give it access to system groups necessary to access resouces in the system. I used the following command to do this: sudo useradd -c "Kodi Service Account" -G dbus,network,video,audio,optical,storage,users -m kodiuser
Create two additional groups which LightDM needs to enable autologin:
- sudo groupadd -r autologin
- sudo groupadd -r nopasswdlogin
Add kodiuser to those groups:
- sudo gpasswd -a kodiuser autologin
- sudo gpasswd -a kodiuser nopasswdlogin
Jan 15 2017
EDIT: 20170123 - My reviewers have suggested some edits to the article, many of which I've applied.
It's been a while since I wrote a Huginn tutorial, so let's start with a basic one to get you comfortable with the idea of building an agent network. This agent network will run every half hour, poll a REST API endpoint, and e-mail you what it gets. You'll have to have access to an already running Huginn instance that can send outbound e-mail. This post is going to be kind of lengthy, but that's because I'm laying out some fundamentals. Once you understand those you can skip past the explanations and move on to the good stuff.
First, a little background - what's a REST API? If you already know just skip down past the cut and move on, but if you don't know what I'm talking about I'll try to explain. I'm going to assume that you've been able to install Huginn using my instructions or someone else's, or you've got access to a running instance. I'm also going to assume that you're not a hardcore coder, you're someone who's trying to apply a useful tool to your everyday life.
At its simplest, an API (Application Program Interface) is a way to interact with a system or part of a system. It's (hopefully) designed to be regular, which means that once you understand the basics you can apply that knowledge to figure out the more complex parts with a little messing around because the basics continue to apply. Let's say that I've written a library called myLib, which implements a bunch of really annoying stuff (like opening and closing files and sorting elements of data) so you don't have to. My library has a bunch of functions that carry out those tasks (openStupidFile(), readAllOfFilesContents(), sortIntegers(), sortFloatingPointValues(), searchThisCrapForAString()) when you call them in your own code. Those functions are part of my library's API. In the documentation are instructions for calling each function, which includes the arguments you need to pass to each function (e.g., openStupidFile() takes two arguments, a full path to a file and 'r' for read-only or 'rw' for read-write, and it returns a handle to the file that you can pass to another function or NULL if it failed). The data type each function returns (the file handle or NULL value) is part of the API, as are the arguments each function takes (path to the file and 'r' or 'rw').
The same principle has been applied to the Web in several different ways. What concerns us right now is something called the RESTful API (REpresentational State Transfer), which basically means interacting with a web service using HTTP verbs (GET, PUT, POST, and so forth) and referencing URLs instead of functions in a library. Like HTTP, requests are stateless, which means that you make a request, the server responds, and there's no further context beyond that. You can think of RESTful APIs as fire-and-forget. The general idea is that there is a web server of some kind, which could be a traditional one like Apache or a specialized one running inside a web app built around a server like web.py which responds to those URLs in some way. If you make a GET request to a URL, it'll send you some data. If you make a PUT request you replace something on the server at that URL with something you send it. If you make a POST request you create a new something on the server. If you make a DELETE request that something on the server gets erased. All of this depends on the HTTP verbs the server supports (not all REST APIs need to support all of them), your access permissions (not every account can do everything), whether or not you've authenticated to the server (it is sometimes the case that read-only access doesn't require an account but read-write access does require an account or an API token or something else along those lines), or who owns a particular resource (Alice's resources are read-only for every other account on the server, but read-write for her alone), of course. REST makes life easier but it's not carte blanche to run hog wild. Additionally, many REST API services enforce access limits - you get so many requests per minute, hour, or day and after that it returns errors. For example, Twitter's API will return an Error 420 (enhance your calm) if you trip their rate limiter.
Jan 02 2017
Since PivotX went out of support I've been running the Bolt CMS for my website at Dreamhost (referral link). A couple of weeks back you may have noticed some trouble my site was having, due to my running into significant difficulty encountered when upgrading from the v2.x release series to the v3.x release series. Some stuff went sideways, and I had to restore from backup at least once before I managed to get the upgrade procedure straightened out with the help of some of the developers in the Bolt IRC channel on Freenode. If it wasn't for help from rossriley it would have taken significantly longer to un-fuck my website.
Here's the procedure that I used to get my site upgraded to the latest release of Bolt.
Jan 02 2017
20170107: It's not "group name" it's "Group ID." I don't know how to find that yet.
The communications program Signal by Open Whisper Systems is unique in several respects. Firstly, its barrier to entry is minimal. You can search for it in the Google Play online store or Apple iOS appstore and it's waiting there for you at no cost. Second, it's designed for security by default, i.e., you don't have to mess around with it to make it work, and it does does the right thing automatically and enforces strong encryption by default (unlike a lot of personal security software). It interoperates seamlessly with people who don't use Signal but you have the option to invite them to install it with a single tap. Its protocol is an open standard that multiple companies have implemented, so theoretically anyone can write their own implementation of the client (Android, iOS) or server, or compile it for themselves. It's an SMS/MMS application, so you can use it as your default text messaging client on your mobile, plus it can do text message conferencing with multiple people automatically (it's a great way to keep in touch with friends if you're at the same con). There's even a desktop Signal client that runs inside of Google Chrome or Chromium (source code for the interested and curious).
So, why, exactly am I posting about Signal?
There is a little-known command-line implementation of Signal that I've been experimenting with because I eventually plan on writing a bot for my exocortex. In playing around with it, I've come to realize that it's not particularly friendly to use at all, and I might have to break down and use the dbus interface to do anything useful with it. Which I don't look forward to, but that's not the point. The point is, I've compiled some notes about how to use the command line version of Signal and I wanted to put them online in case somebody will find them helpful.