Saving stuff before it vanishes down the memory hole.

Jan 26 2017

UPDATE - 20170302 - Added Firefox plugin for the Internet Archive.

UPDATE - 20170205 - Added Chrome plugin for the Internet Archive.

Note: This article is aimed at people all across the spectrum of levels of experience with computers.  You might see a lot of stuff you already know; then again, you might learn one or two things that hadn't showed up on your radar yet.  Be patient.

In George Orwell's novel 1984, one of his plot points of the story was something called the Memory Hole. They were slots all over the building in which Winston Smith worked, into which documents which the Party considered seditious or merely inconvenient were deposited for incineration.  Anything that the Ministry of Truth decided had to go because it posed a threat to the party line was destroyed.  This meant that if anyone wanted to go back and double check to see what history might have been, the only thing they could get hold of were "officially sanctioned" documents written to reflect the revised Party policy.  Human memory's funny: If you don't have any static representation of something to refer back to periodically, eventually you come to think that whatever people have been telling you is the real deal, regardless of what you just lived through.  No mind tricks are necessary, just repetition.

The Net's a lot like that.  There are literally piles and piles of information everywhere you look, but most of it resides on systems that aren't yours.  This blog is running on somebody else's server, and it wouldn't take much to wipe it off the face of the Net.  All it would take is a DMCA takedown notice with no evidence (historically speaking, this is usually the case).  This has happened in the past a number of times, including to an archive maintained by Project Gutenberg and documents explicitly placed into the public domain so somebody could try to make a buck off of them.  This is a common enough thing that the IETF has made a standard HTTP error code to reflect it, Error 451 - Unavailable for legal reasons.

So, how would you make local copies of information that you think might be pulled down because somebody thought it was inconvenient?  For example, climatological data archives?

The 2016 election and weird patterns on Twitter.

Dec 18 2016

You've already read my opinion of the 2016 election's outcome so I'll not subject you to it again. However, I would like to talk about some weird stuff I (we, really) kept noticing on Twitter in the days and weeks leading up to Election Day.

As I've often spoken of in the past, a nontrivial portion of my Exocortex is tasked with monitoring global activity on Twitter by hooking into the back-end API service and pulling raw data out to analyze. Those agents fire on a stagged schedule, anywhere from every 30 minutes to every two hours; a couple of dozen follow specific accounts while others use the public streaming API and grab large samples of every tweet that hits Twitter around the world.

If you want to look at a simplified version of that agent network to see how it works I've made it available on Github. As you can see, the output of that particular agent network is batched into e-mails of arbitrary size using the Email Digest Agent and is sent to one of my e-mail addresses as a single batch. The reason for this is twofold; it's easier to scan through a large e-mail and look for patterns visually than it is to scan through several dozen to several hundred separate messages in sequence, and it uses fewer system resources on my e-mail provider to store and present to me that output.

Six or seven weeks before Election Day, Lifeline (the recognition code for the agent network which carries out these sorts of tasks for me) started sending me gigantic e-mail digests every hour or so, containing something like several hundred tweets at a time (the biggest was nearly a thousand, as I recall). Scanning through those e-mails showed that most of the tweets were largely identical, save for the @username that sent them. Tweets about CNN and the Washington Post being GRU and SVR disinformation projects or on-the-ground reporting tagged with #fakenews. Links pointing to Infowars articles (the tweets consisted of the titles of posts, links, and the same sets of hashtags; if you ran the Twitter-compressed URLs through a URL unshortener they all pointed to the same posts). Anti-Bernie and anti-Hillary tweets that all had the same content and the same hashtags. Trump as the second coming messages and calls to action. Rivers of bile directed at political comentators and reporters. Links to fake Wikileaks Podesta e-mails that went to Pastebin or other post-and-forget sites (there wasn't even enough data in the fakes to attempt to validate them (by the bye, the method linked to is really easy to automate)). I saw the same phenomenon with #pizzagate tweets, only the posts came in shorter bursts more irregularly. It went on and on, day and night for weeks, hundreds upon hundreds of unique copies of the same text from hundreds of different accounts. I had to throw more CPUs at Exocortex to keep up with the flood.

All of these posts, when taken together as groups or families consisted of exactly the same text each and every time, though the t.co URLs were different (a brief digression: Twitter's URL shortening service seems to generate different outputs for the same input URL to implement statistics gathering and user tracking as part of its business strategy). Additionally, all of those posts went up more or less within the same minute. The Twitter API doesn't let you pull the IP addresses tweets were sent from but the timestamps are available to the second. If you looked at the source field of each tweet (you'll need to scroll down a bit), they were all largely the same, usually empty (""), with a few minor exceptions here and there. The activity pattern strongly suggests that bots were used to strafe circles of human-controlled accounts on Twitter that roughly correspond to memetic communities. Figuring that somebody had already done some kind of visualization analysis (which I suck at), I had Argus (one of my web search bots) do some digging and he found a bunch of pages like this study, which seem to back up my observations.

The sort of horsepower needed to create such an army of bots would be very easy to assemble: Buy a bunch of virtual machines on Amazon's EC2. Write a couple of bots using Ruby or Python. Sign up for a bunch of Twitter accounts or just buy them in bulk. Make a Docker image that'll effectively turn one EC2 instance into as many as you can reasonably run without crashing the VM. Deploy lots and lots of copies of your bots into those Docker containers. Use an orchestration mechanism like Ansible to configure the bots with API keys and command them en masse; if you're in a time crunch you could even use something like pssh to fire them all up with a single command. Turn them loose. If you've been in IT for a year, this is a Saturday afternoon project that won't cost you a whole lot, but could make you a lot of money.

"Well, yeah, there was an army of bots advertising on Twitter. What else is new?" you're probably saying.

What I am saying is simply this: This post describes a little bit about how this sort of media strategy works, what the patterns look like at the 50000 foot view, and my/our observations. I don't think I did anything really ground-breaking here, only in the sense that I used a bunch of AI systems that stumbled across what was going on by accident. It was the hardcore data scientists who did the real academic work on it (though that work is a bit inaccessible unless you're a computer geek).

Memetic warfare is here, and our social networks at the battlegrounds. Armor up.

Exocortex: Halo

Mar 26 2016

In my last post on the topic of exocortices I discussed the Huginn project, how it works, what the code for the agents actually look like, and some of the stuff I use Huginn's agent networks for for in my everyday life. In short, I call it my exocortex - an extension of the information processing capabilities of my brain running in silico instead of in vivo. Now I'm going to talk about Exocortex Halo, a separate suite of bots which augment Huginn to carry out tasks that Huginn by itself isn't designed to carry out very easily, and thus extend my personal capabilities significantly.

Now, don't get me wrong, Huginn has a fantastic suite of agents built into it already and more are being added every day. However, good design techniques require one to realize when an existing software architecture is suited for some things and not others, and allowances should be made for that. To put it another way, it was highly unlikely that I would be able to shoehorn the additional functionality I wanted into Huginn and have a hope in hell of it working. However, what Huginn has a multitude of are interfaces for getting events into and out of itself, and I could make use of those interfaces for plugging my own bots into it. The Website Agent is ideal for pinging REST API interfaces of my own design; Jabber Agent implements a simple XMPP client which can send events to an address on an XMPP server (assuming that it has its own login credentials); oversimplifying a bit, Webhook Agent basically sets up a custom REST API rail that external software can use to send events into Huginn for processing; Data Output Agent is used for sending events out of Huginn in the form of an RSS feed or a JSON document that can be consumed and parsed by other software.