Last weekend I was running short of stuff to hack around on and lamented this fact on the Fediverse. I was summarily challenged to find a way to archive posts to the Fediverse in an open, easy to understand data format that was easy to index, and did not use any third party services (like IFTTT or Zapier). I thought about it a bit and came up with a reasonably simple solution that uses three Huginn agents to collect, process, and write out posts as individual JSON documents to the same box I run that part of my exocortex on. This is going to go deep geek below the cut so if it's not your cup of tea, feel free to move on to an earlier post.
A common task that people using Huginn set up as their "Hello, world!" project is getting the daily weather report because it's practical, easy, and fairly well documented. However, the existing example is somewhat obsolete because it references the Weather Underground API that no longer exists, having been sunset at the end of 2018. Recently, the Weather Underground code in the Huginn Weather Agent was taken out because it's no longer usable. But, other options exist. The US National Weather Service has a free to use API that we can use with Huginn with a little extra work. Here's what we have to do:
- Get the GPS coordinates for the place we want weather reports for.
- Use the GPS coordinates to get data out of the NWS API.
- Build a weather report message.
- E-mail it.
As happens sometimes, the admins of the NWS API have imposed an additional constraint upon users accessing their data: They ask that the user agent string of whatever software you use be unique, and ideally include an e-mail address they can contact you through in case something goes amiss. This isn't a big deal.
This tutorial assumes that you've worked with Huginn a bit in the past, but if you haven't I strongly suggest that you read my earlier posts to familiarize yourself.
Okay. Let's get started.
It seems that there is another influx of refugees from a certain social network that's turned into a never ending flood of bile, vitriol, and cortisol into what we call the Fediverse, a network of a couple of thousand websites running a number of different applications that communicate with each other over a protocol called ActivityPub. Ultimately, the Fediverse is different from Twitter and Facebook in that it's not run as a for-profit entity. There are no analytics, no suggestions of "thought leaders" you might want to follow, no automated curation of the posts you can see versus the ones you really want to see. Socially speaking, you don't find people carefully polishing their brands or trying to game hashtag trends but instead everything from somebody kicking back after work with a cup of coffee to people carefully archiving the firmware of classic computer hardware to in-jokes about pineapples. Rather than fame, you get people.
But that's not what I want to talk about. I've been asked by a couple of people to post a brief tutorial of how I interfaced my Huginn instance with mastodon.social, the Mastodon instance that I spend most of my time hanging out on.
A Google feature that doesn't ordinarily get a lot of attention is Google Alerts, which is a service that sends you links to things that match certain search terms on a periodic basis. Some people use it for vanity searching because they have a personal brand to maintain, some people use it to keep on top of a rare thing they're interested in (anyone remember the show Probe?), some people use it for bargain hunting, some people use it for intel collection... however, this is all predicated on Google finding out what you're interested in, certainly interested enough to have it send you the latest search results on a periodic basis. Not everybody's okay with that.
A while ago, I built my own version of Google Alerts using a couple of tools already integrated into my exocortex which I use to periodically run searches, gather information, and compile reports to read when I have a spare moment. The advantage to this is that the only entities that know about what I'm interested in are other parts of me, and it's as flexible as I care to make it. The disadvantage is that I have some infrastructure to maintain, but as I'll get to in a bit there are ways to mitigate the amount of effort required. Here's how I did it...
EDIT: 20170123 - My reviewers have suggested some edits to the article, many of which I've applied.
It's been a while since I wrote a Huginn tutorial, so let's start with a basic one to get you comfortable with the idea of building an agent network. This agent network will run every half hour, poll a REST API endpoint, and e-mail you what it gets. You'll have to have access to an already running Huginn instance that can send outbound e-mail. This post is going to be kind of lengthy, but that's because I'm laying out some fundamentals. Once you understand those you can skip past the explanations and move on to the good stuff.
First, a little background - what's a REST API? If you already know just skip down past the cut and move on, but if you don't know what I'm talking about I'll try to explain. I'm going to assume that you've been able to install Huginn using my instructions or someone else's, or you've got access to a running instance. I'm also going to assume that you're not a hardcore coder, you're someone who's trying to apply a useful tool to your everyday life.
At its simplest, an API (Application Program Interface) is a way to interact with a system or part of a system. It's (hopefully) designed to be regular, which means that once you understand the basics you can apply that knowledge to figure out the more complex parts with a little messing around because the basics continue to apply. Let's say that I've written a library called myLib, which implements a bunch of really annoying stuff (like opening and closing files and sorting elements of data) so you don't have to. My library has a bunch of functions that carry out those tasks (openStupidFile(), readAllOfFilesContents(), sortIntegers(), sortFloatingPointValues(), searchThisCrapForAString()) when you call them in your own code. Those functions are part of my library's API. In the documentation are instructions for calling each function, which includes the arguments you need to pass to each function (e.g., openStupidFile() takes two arguments, a full path to a file and 'r' for read-only or 'rw' for read-write, and it returns a handle to the file that you can pass to another function or NULL if it failed). The data type each function returns (the file handle or NULL value) is part of the API, as are the arguments each function takes (path to the file and 'r' or 'rw').
The same principle has been applied to the Web in several different ways. What concerns us right now is something called the RESTful API (REpresentational State Transfer), which basically means interacting with a web service using HTTP verbs (GET, PUT, POST, and so forth) and referencing URLs instead of functions in a library. Like HTTP, requests are stateless, which means that you make a request, the server responds, and there's no further context beyond that. You can think of RESTful APIs as fire-and-forget. The general idea is that there is a web server of some kind, which could be a traditional one like Apache or a specialized one running inside a web app built around a server like web.py which responds to those URLs in some way. If you make a GET request to a URL, it'll send you some data. If you make a PUT request you replace something on the server at that URL with something you send it. If you make a POST request you create a new something on the server. If you make a DELETE request that something on the server gets erased. All of this depends on the HTTP verbs the server supports (not all REST APIs need to support all of them), your access permissions (not every account can do everything), whether or not you've authenticated to the server (it is sometimes the case that read-only access doesn't require an account but read-write access does require an account or an API token or something else along those lines), or who owns a particular resource (Alice's resources are read-only for every other account on the server, but read-write for her alone), of course. REST makes life easier but it's not carte blanche to run hog wild. Additionally, many REST API services enforce access limits - you get so many requests per minute, hour, or day and after that it returns errors. For example, Twitter's API will return an Error 420 (enhance your calm) if you trip their rate limiter.
First, you need someplace for the software to live. I'll say up front that you can happily run Huginn on your laptop, desktop workstation, or server so long as it's not running Windows. Huginn is developed under Linux; it might run under one of the BSDs but I've never tried. I don't know if it'll run as expected in MacOSX because I don't have a Mac. If you want to give Huginn a try but you run Windows, I suggest installing VirtualBox and build a quick virtual machine. I recommend sticking with the officially supported distributions and use the latest stable version of Ubuntu Server. At the risk of sounding self-serving, I also suggest using one of my open source Ubuntu hardening sets to lock down the security on your new VM all in one go. If you're feeling adventurous you can get a VPS from a hosting provider like Amazon's AWS or Linode. I run some of my stuff at Digital Ocean and I'm very pleased with their service. If you'd like to give Digital Ocean a try here's my referral link which will give you $10us of credit, and you are not obligated to continue using their service after it's used up. If I didn't like their service (both commercial and customer) that much I wouldn't bother passing it around.
As serious web apps go, Huginn's system requirements aren't very high so you can build a very functional instance without putting a lot of effort or money toward it. You can run Huginn in about one gigabyte of RAM and one CPU, with a relatively small amount of disk space (twenty gigabytes or so, a fairly small amount for servers these days). Digital Ocean's $10us/month droplet (one CPU, one gigabyte of RAM, and 30 gigabytes of storage) is sufficient for experimentation and light use. To really get serious usage out of Huginn you'll need about two gigabytes of RAM to fit multiple worker daemons into memory. I personally use the following specs for all of my Huginn virtual machines: At least two CPUs, 60 gigabytes of disk space, and at least four gigabytes of RAM. Chances are, any physical machine you have on your desk exceeds these requirements so don't worry too much about it (but see these special instructions if you plan on using an ultra-mini machine like the Raspberry Pi). If you build your own virtual machine, take into account these requirements.
For starters, thank you everyone who attended my talk at HOPE XI. I know it was on Sunday afternoon when a lot of people were either getting ready to go home, spending their last bits of time with friends they don't get to see often, or fried from partying the night before. Your attending means a lot to me, and I can't thank you enough. That said, here are the slides from my talk as a single HTML page to read online and as a PDF document to read offline (both were authored in Markdown and generated with Landslide).
BONUS! Here are some proof-of-concept agent networks that you can load into your own Huginn instances and experiment with! Butterfly In China is the agent network that generates my daily weather reports. Shake, Rattle, and Roll monitors the USGS' seismic activity alerting system for earthquakes of a certain strength or above. Tripwire is an HTML parsing-heavy agent network that pulls FBI Most Wanted Lists and sends alerts when they change.