Challenge accepted: Archiving a Mastodon account with Huginn

Nov 17 2019

Last weekend I was running short of stuff to hack around on and lamented this fact on the Fediverse.  I was summarily challenged to find a way to archive posts to the Fediverse in an open, easy to understand data format that was easy to index, and did not use any third party services (like IFTTT or Zapier).  I thought about it a bit and came up with a reasonably simple solution that uses three Huginn agents to collect, process, and write out posts as individual JSON documents to the same box I run that part of my exocortex on.  This is going to go deep geek below the cut so if it's not your cup of tea, feel free to move on to an earlier post.

For starters, Huginn needs to have ENABLE_INSECURE_AGENTS=true set in the huginn/.env config file to enable the Local File Agent, which can read from and write to files on whatever host you're running Huginn on.  It's not active by default because this has a certain amount of risk but if you're the only person using that instance I think it's reasonably safe.  If you don't have this config setting set to "true", edit the huginn/.env file and restart Huginn.  I'll wait.

The process has three steps:

  • Pull the account's latest posts.
  • Process each event into an easy to read JSON document.
  • Write the JSON document someplace with a logical filename.

Step one is probably the easiest, because every Mastodon account has two public data feeds (warning: orange website), an ATOM feed and an RSS feed.  If you're curious, here are mine:

I didn't have to mess around with the Mastodon API this time because I could just run the ATOM feed into an RSS Agent.  Every new post in that feed is captured by Huginn as a separate event internally.  Only if there are new posts will any events be emitted.  The settings for this agent look like this:

{
  "expected_update_period_in_days": "365",
  "clean": "false",
  "url": "https://hackers.town/@drwho.atom"
}

Step two: Take each event, pick out only the stuff I care about and stuff it into a new event.  This is the bread and butter of an Event Formatting Agent which repacks data from each event into a data structure called "toot".  This is necessary because I wasn't able to process the raw events directly, they had to be reformatted first.  The agent in question looks like this:

{
  "instructions": {
    "toot": {
      "id": "{{ id }}",
      "url": "{{ url }}",
      "links": "{{ links }}",
      "title": "{{ title }}",
      "content": "{{ content }}",
      "authors": "{{ authors }}",
      "categories": "{{ categories }}",
      "last_updated": "{{ last_updated }}"
    }
  },
  "mode": "clean"
}

Step three: Write each post out as a separate file.  This is where the Local File Agent comes in, and it wound up being the trickiest part because I'd never worked with it before.  It took some trial and error before I got it right, so here's the agent configuration, which I'll explain in a moment:

{
  "mode": "write",
  "watch": "false",
  "path": "/home/drwho/hackers.town-archive/{{ 'now' | date: '%s' }}.json",
  "append_radio": "false",
  "append": "false",
  "data": "{{ toot | json }}"
}

"mode" - The agent will be writing data, not reading it.

"watch" - Do not watch a file for any changes because we're writing data, not reading it.

"path" - The full path with filename on the local host to write data to.  The filespec has a Liquid template that basically says "get the current system time, convert it into time_t, and use that as the filename."  I picked this style of datestamp because they naturally sort in ascending numerical order in a directory listing, so the order in which they occurred is implicit.

"append_radio" and "append" - Do not append to the file, just write it out in toto.  This value is set by a radio button but is stored internally as a boolean.

"data" - The stuff that should be written to the file.  In this case, the data structure called "toot", which gets run through another Liquid template filter to turn it into a JSON document, which looks like this on disk:

{
    "id": "https://hackers.town/users/drwho/statuses/103155117891294839",
    "url": "https://hackers.town/@drwho/103155117891294839",
    "links": "{\"href\"=>\"https://hackers.town/users/drwho/statuses/103155117891294839\", \"rel\"=>\"alternate\", \"type\"=>\"application/activity+json\"}{\"href\"=>\"http://activityschema.org/collection/public\", \"rel\"=>\"mentioned\"}{\"href\"=>\"https://hackers.town/@drwho/103155117891294839\", \"rel\"=>\"alternate\", \"type\"=>\"text/html\"}{\"href\"=>\"https://hackers.town/users/drwho/updates/310508.atom\", \"rel\"=>\"self\", \"type\"=>\"application/atom+xml\"}",
    "title": "New status by drwho",
    "content": "<p>Has anyone running a <a href=\"https://hackers.town/tags/pwnagotchi\" class=\"mention hashtag\" rel=\"tag\">#<span>pwnagotchi</span></a> with a Waveshare v2 display encountered a situation where the display looks like it&apos;s snow crashed when it&apos;s in AUTO mode?  Trying to figure out why Dalton keeps doing that.</p>",
    "authors": "",
    "categories": "pwnagotchi",
    "last_updated": "2019-11-17T20:23:04+00:00"
}

The three agents are hooked together in series, in the order I've written about them.  It's a very simple three agent network.  For reference I've uploaded it to my Github repository of Huginn agent networks named Elephant, because it's said that they never forget (and I liked the idea of elephants being a memetic counterpart to mastodons).

Enjoy!