Jul 28 2019
I spend a lot of time digging around in other people's data. If I'm not hunting for anything in particular then it's a bit of a crapshoot, to be honest, if only because you never know what you're in for. You can pretty much take it to the bank that if you didn't assemble it yourself, you can't count on it being complete, well formed, or anything approximating the output of a human being (it usually came out of a database, but I think you see what I'm getting at). Sometimes, if I'm really lucky I'll just get hold of a JSON dump of the database, which to be fair is better than nothing when there isn't even an API to use. From time to time I'll make an attempt at fitting the data into a database of some kind, sometimes MySQL, sometimes SQLite, or occasionally an API layer like Sandman2. This is all well and good, but it winds up being more of an adventure than I'm looking for. I'd much rather be Indiana Jones prowling around in the temple than Rambo going through a preparation montage because Indy was actually getting stuff done.
Wow, this article went a little off the rails. I was never good at writing intros to new code... anyway.
Jan 15 2017
EDIT: 20170123 - My reviewers have suggested some edits to the article, many of which I've applied.
It's been a while since I wrote a Huginn tutorial, so let's start with a basic one to get you comfortable with the idea of building an agent network. This agent network will run every half hour, poll a REST API endpoint, and e-mail you what it gets. You'll have to have access to an already running Huginn instance that can send outbound e-mail. This post is going to be kind of lengthy, but that's because I'm laying out some fundamentals. Once you understand those you can skip past the explanations and move on to the good stuff.
First, a little background - what's a REST API? If you already know just skip down past the cut and move on, but if you don't know what I'm talking about I'll try to explain. I'm going to assume that you've been able to install Huginn using my instructions or someone else's, or you've got access to a running instance. I'm also going to assume that you're not a hardcore coder, you're someone who's trying to apply a useful tool to your everyday life.
At its simplest, an API (Application Program Interface) is a way to interact with a system or part of a system. It's (hopefully) designed to be regular, which means that once you understand the basics you can apply that knowledge to figure out the more complex parts with a little messing around because the basics continue to apply. Let's say that I've written a library called myLib, which implements a bunch of really annoying stuff (like opening and closing files and sorting elements of data) so you don't have to. My library has a bunch of functions that carry out those tasks (openStupidFile(), readAllOfFilesContents(), sortIntegers(), sortFloatingPointValues(), searchThisCrapForAString()) when you call them in your own code. Those functions are part of my library's API. In the documentation are instructions for calling each function, which includes the arguments you need to pass to each function (e.g., openStupidFile() takes two arguments, a full path to a file and 'r' for read-only or 'rw' for read-write, and it returns a handle to the file that you can pass to another function or NULL if it failed). The data type each function returns (the file handle or NULL value) is part of the API, as are the arguments each function takes (path to the file and 'r' or 'rw').
The same principle has been applied to the Web in several different ways. What concerns us right now is something called the RESTful API (REpresentational State Transfer), which basically means interacting with a web service using HTTP verbs (GET, PUT, POST, and so forth) and referencing URLs instead of functions in a library. Like HTTP, requests are stateless, which means that you make a request, the server responds, and there's no further context beyond that. You can think of RESTful APIs as fire-and-forget. The general idea is that there is a web server of some kind, which could be a traditional one like Apache or a specialized one running inside a web app built around a server like web.py which responds to those URLs in some way. If you make a GET request to a URL, it'll send you some data. If you make a PUT request you replace something on the server at that URL with something you send it. If you make a POST request you create a new something on the server. If you make a DELETE request that something on the server gets erased. All of this depends on the HTTP verbs the server supports (not all REST APIs need to support all of them), your access permissions (not every account can do everything), whether or not you've authenticated to the server (it is sometimes the case that read-only access doesn't require an account but read-write access does require an account or an API token or something else along those lines), or who owns a particular resource (Alice's resources are read-only for every other account on the server, but read-write for her alone), of course. REST makes life easier but it's not carte blanche to run hog wild. Additionally, many REST API services enforce access limits - you get so many requests per minute, hour, or day and after that it returns errors. For example, Twitter's API will return an Error 420 (enhance your calm) if you trip their rate limiter.