Building your own Google Alerts with Huginn and Searx.

Sep 30 2017

A Google feature that doesn't ordinarily get a lot of attention is Google Alerts, which is a service that sends you links to things that match certain search terms on a periodic basis.  Some people use it for  vanity searching because they have a personal brand to maintain, some people use it to keep on top of a rare thing they're interested in (anyone remember the show Probe?), some people use it for bargain hunting, some people use it for intel collection... however, this is all predicated on Google finding out what you're interested in, certainly interested enough to have it send you the latest search results on a periodic basis.  Not everybody's okay with that.

A while ago, I built my own version of Google Alerts using a couple of tools already integrated into my exocortex which I use to periodically run searches, gather information, and compile reports to read when I have a spare moment.  The advantage to this is that the only entities that know about what I'm interested in are other parts of me, and it's as flexible as I care to make it.  The disadvantage is that I have some infrastructure to maintain, but as I'll get to in a bit there are ways to mitigate the amount of effort required.  Here's how I did it...

Huginn: Writing a simple agent network.

Jan 15 2017

EDIT: 20170123 - My reviewers have suggested some edits to the article, many of which I've applied.

It's been a while since I wrote a Huginn tutorial, so let's start with a basic one to get you comfortable with the idea of building an agent network.  This agent network will run every half hour, poll a REST API endpoint, and e-mail you what it gets.  You'll have to have access to an already running Huginn instance that can send outbound e-mail.  This post is going to be kind of lengthy, but that's because I'm laying out some fundamentals.  Once you understand those you can skip past the explanations and move on to the good stuff.

First, a little background - what's a REST API?  If you already know just skip down past the cut and move on, but if you don't know what I'm talking about I'll try to explain.  I'm going to assume that you've been able to install Huginn using my instructions or someone else's, or you've got access to a running instance.  I'm also going to assume that you're not a hardcore coder, you're someone who's trying to apply a useful tool to your everyday life.

At its simplest, an API (Application Program Interface) is a way to interact with a system or part of a system.  It's (hopefully) designed to be regular, which means that once you understand the basics you can apply that knowledge to figure out the more complex parts with a little messing around because the basics continue to apply.  Let's say that I've written a library called myLib, which implements a bunch of really annoying stuff (like opening and closing files and sorting elements of data) so you don't have to.  My library has a bunch of functions that carry out those tasks (openStupidFile(), readAllOfFilesContents(), sortIntegers(), sortFloatingPointValues(), searchThisCrapForAString()) when you call them in your own code.  Those functions are part of my library's API.  In the documentation are instructions for calling each function, which includes the arguments you need to pass to each function (e.g., openStupidFile() takes two arguments, a full path to a file and 'r' for read-only or 'rw' for read-write, and it returns a handle to the file that you can pass to another function or NULL if it failed).  The data type each function returns (the file handle or NULL value) is part of the API, as are the arguments each function takes (path to the file and 'r' or 'rw').

The same principle has been applied to the Web in several different ways.  What concerns us right now is something called the RESTful API (REpresentational State Transfer), which basically means interacting with a web service using HTTP verbs (GET, PUT, POST, and so forth) and referencing  URLs instead of functions in a library.  Like HTTP, requests are stateless, which means that you make a request, the server responds, and there's no further context beyond that.  You can think of RESTful APIs as fire-and-forget.  The general idea is that there is a web server of some kind, which could be a traditional one like Apache or a specialized one running inside a web app built around a server like web.py which responds to those URLs in some way.  If you make a GET request to a URL, it'll send you some data.  If you make a PUT request you replace something on the server at that URL with something you send it.  If you make a POST request you create a new something on the server.  If you make a DELETE request that something on the server gets erased.  All of this depends on the HTTP verbs the server supports (not all REST APIs need to support all of them), your access permissions (not every account can do everything), whether or not you've authenticated to the server (it is sometimes the case that read-only access doesn't require an account but read-write access does require an account or an API token or something else along those lines), or who owns a particular resource (Alice's resources are read-only for every other account on the server, but read-write for her alone), of course.  REST makes life easier but it's not carte blanche to run hog wild.  Additionally, many REST API services enforce access limits - you get so many requests per minute, hour, or day and after that it returns errors.  For example, Twitter's API will return an Error 420 (enhance your calm) if you trip their rate limiter.

Exocortex: Setting up Huginn

Sep 11 2016

In my last post I said that I'd describe in greater detail how to set up the software that I use as the core of my exocortex, called Huginn.

First, you need someplace for the software to live. I'll say up front that you can happily run Huginn on your laptop, desktop workstation, or server so long as it's not running Windows. Huginn is developed under Linux; it might run under one of the BSDs but I've never tried. I don't know if it'll run as expected in MacOSX because I don't have a Mac. If you want to give Huginn a try but you run Windows, I suggest installing VirtualBox and build a quick virtual machine. I recommend sticking with the officially supported distributions and use the latest stable version of Ubuntu Server. At the risk of sounding self-serving, I also suggest using one of my open source Ubuntu hardening sets to lock down the security on your new VM all in one go. If you're feeling adventurous you can get a VPS from a hosting provider like Amazon's AWS or Linode. I run some of my stuff at Digital Ocean and I'm very pleased with their service. If you'd like to give Digital Ocean a try here's my referral link which will give you $10us of credit, and you are not obligated to continue using their service after it's used up. If I didn't like their service (both commercial and customer) that much I wouldn't bother passing it around.

As serious web apps go, Huginn's system requirements aren't very high so you can build a very functional instance without putting a lot of effort or money toward it. You can run Huginn in about one gigabyte of RAM and one CPU, with a relatively small amount of disk space (twenty gigabytes or so, a fairly small amount for servers these days). Digital Ocean's $10us/month droplet (one CPU, one gigabyte of RAM, and 30 gigabytes of storage) is sufficient for experimentation and light use. To really get serious usage out of Huginn you'll need about two gigabytes of RAM to fit multiple worker daemons into memory. I personally use the following specs for all of my Huginn virtual machines: At least two CPUs, 60 gigabytes of disk space, and at least four gigabytes of RAM. Chances are, any physical machine you have on your desk exceeds these requirements so don't worry too much about it (but see these special instructions if you plan on using an ultra-mini machine like the Raspberry Pi). If you build your own virtual machine, take into account these requirements.

Slides from my HOPE XI talk.

Jul 16 2016

For starters, thank you everyone who attended my talk at HOPE XI. I know it was on Sunday afternoon when a lot of people were either getting ready to go home, spending their last bits of time with friends they don't get to see often, or fried from partying the night before. Your attending means a lot to me, and I can't thank you enough. That said, here are the slides from my talk as a single HTML page to read online and as a PDF document to read offline (both were authored in Markdown and generated with Landslide).

Once again, the source code for Huginn can be found here, and the source tree for the Halo project can be found here.

BONUS! Here are some proof-of-concept agent networks that you can load into your own Huginn instances and experiment with! Butterfly In China is the agent network that generates my daily weather reports. Shake, Rattle, and Roll monitors the USGS' seismic activity alerting system for earthquakes of a certain strength or above. Tripwire is an HTML parsing-heavy agent network that pulls FBI Most Wanted Lists and sends alerts when they change.