Building your own Google Alerts with Huginn and Searx.

Sep 30 2017

A Google feature that doesn't ordinarily get a lot of attention is Google Alerts, which is a service that sends you links to things that match certain search terms on a periodic basis.  Some people use it for  vanity searching because they have a personal brand to maintain, some people use it to keep on top of a rare thing they're interested in (anyone remember the show Probe?), some people use it for bargain hunting, some people use it for intel collection... however, this is all predicated on Google finding out what you're interested in, certainly interested enough to have it send you the latest search results on a periodic basis.  Not everybody's okay with that.

A while ago, I built my own version of Google Alerts using a couple of tools already integrated into my exocortex which I use to periodically run searches, gather information, and compile reports to read when I have a spare moment.  The advantage to this is that the only entities that know about what I'm interested in are other parts of me, and it's as flexible as I care to make it.  The disadvantage is that I have some infrastructure to maintain, but as I'll get to in a bit there are ways to mitigate the amount of effort required.  Here's how I did it...

The first thing I did was set up a locally running copy of the Searx meta-search engine and configure it to use a bunch of different search engines, including a node of the YaCy search network that I run on one of my servers.  This is the same meta-search engine that forms the back-end of my web search bot and is pretty standard stuff for my current infrastructure.  That it is designed with privacy in mind is a definite bonus.  If you don't want to run your own copy, or for whatever reason you're not in a position to do so there are a large number of public Searx instances out there that you could use which all support the functionality you need.  For this example we're going to use the primary public Searx server run by the developers as a proof of concept.  A little known fact is that every search you run on a Searx server can have its own RSS feed associated with it, meaning that every time something hits that feed it runs a search and you get new results.  For this example, we'll use the search term "public searx instances."

The search URL for this example is this: https://searx.me/?q=public%20searx%20instances

Now, see that bit where it says "Download results" and has buttons for CSV, JSON, and RSS?  If you click on the RSS button it'll display the RSS feed in your browser window.  Unfortunately, if you click on that button it'll also not give you the URL for the RSS feed; I need to open a ticket about that.  However, all you need to do is tack &format=rss onto the end of the URL to get what you need.  So, the URL we're using for this example really should be this: https://searx.me/?q=public%20searx%20instances&format=rss

You can change that example URL to reflect whatever search terms you want by replacing "public%20searx%20instances" with whatever string you want, just substitute "%20" for the spaces.  This is called URL encoding, incidentally.

Now we bounce over to your Huginn instance, log in, and open a new scenario.  Let's call it Searcherizer.  Now, before we start standing up any agents let's think about what we want to accomplish.  We want to:

  • Pull the RSS feed for some search terms.
  • Run a sentiment analysis on each hit, because the Sentiment Analysis Agent doesn't get enough love.
  • Deduplicate the events generated from that search to eliminate stuff we've seen before.
  • Build part of the search report.
  • E-mail the search report to the Huginn instance's default e-mail address

This workflow is pretty straightforward, it's a chain of sequential steps.  In this example, every agent takes as its Source the previous agent in the chain (with the exception of the RSS Agent, which runs by itself on a schedule)  First, let's create an RSS Agent that runs the search at 4am every day:

{
  "expected_update_period_in_days": "365",
  "clean": "false",
  "url": "https://searx.me/?q=searx%20public%20instance&format=rss"
}

This agent will emit a stream of events every time it runs, one per hit/entry in the RSS feed.  We're going to direct each agent into a Sentiment Analysis Agent, which is going to process the $.content field, append its findings to the event and send them to the next agent in the chain.  Sentiment analysis is a pretty complex thing but setting up the agent itself is deceptively easy:

{
  "content": "$.content",
  "expected_receive_period_in_days": 365
}

As you'll see in a moment, the Sentiment Analysis Agent hangs the fields of every event it receives off of a field called $.original_event, so if you want to refer to something that agent saw but didn't process you have to prepend $.original_event. to it.  You'll see what this can look like in a moment.

Let's remove the events that we've already seen.  The best way to do this is to match on the URL returned by the search hit, because that's what we're really intersted in, after all.  Unfortunately the Deduplication Agent doesn't have a JSON editor so I can't copy and paste the example into this post.  So, here's a list of configuration settings:

  • Agent Type: DeDuplication Agent
  • Name: Searcherizer - Deduplicate search hits
  • Schedule: This type of Agent cannot be scheduled. (It runs every time it receives an event.)
  • Controllers: None, leave this blank.
  • Keep events: 7 days (More than sufficient.)
  • Sources: Searcherizer - Run a sentiment analysis on each search hit
  • Propagate immediately: Nah.  Leave this unchecked.
  • Receivers: Leave this blank.  This field will be retroactively filled when we create the next agent in the process.
  • Scenarios: Searcherizer
  • Property: This is the event field that we want to use to detect duplicates.  Set it to this: {{ $.original_event.url }}
  • Lookback: 100 (This is a nice, round number of previous events to compare against.  You can make it bigger if you like but I don't recommend making it smaller to minimize the possibility of false positives.)
  • Expected update period in days: 365 (I usually default to this.)

Now we need to clean things up a little and turn each event that makes it past the deduplicator into part of the daily search report.  We do this by creating an Event Formatting Agent and use it to write one search result for each event.  It's easier to show you what one looks like than to describe it, because it's basically writing a paragraph of an e-mail (line breaks are to make it easier on mobile screens, they shouldn't be in the actual agent):

{
  "instructions": {
    "message": "{{ $.original_event.title }}<br />{{ $.original_event.url }}
        <br />Good/bad coefficient: {{ valence }}<br />
        Active/passive coefficient: {{ arousal }}<br />
        Strong/weak dominance ratio: {{ dominance }}<br />
        {{ content }}<br /><br />"
  },
  "mode": "clean"
}

In this example, the message field is the writeup of the search hit.  Lines of the paragraph are delineated with HTML <br /> tags (an allowance for sending the messages to a Gmail address).  {{ $.original_event.title }} refers to the $.title field of each search hit sent by Searx, {{ $.original_event.url }} refers to the $.url field (or link to a search hit), and {{ content }} refers to a snippet of text from the search hit if there is any.  The {{ valence }}{{ arousal }}, and {{ dominance }} tags refer to the findings of the Sentiment Analysis Agent, and reflect whether the text tends toward postive or negative sentiment, active or passive voice, and whether the article is written to come across as being strong or weak respectively.

Now we build the final report using an Email Digest Agent, which takes all of the events it's given, packs them into a single email (this is why we used the Event Formatting Agent to make the search hits look nice), and sends them at a programmed time, for the sake of example one hour after the search ran.  We're going to customize the output a little to make it look more nifty seeing as how we're going for style points.  Create an Email Digest Agent and set it to run an hour after the RSS Agent does at 5am.  The agent will look something like this:

{
  "subject": "This is your daily web search report.",
  "from": "searcherizer@your-huginn.example.com",
  "expected_receive_period_in_days": "365",
  "body": "message"
}

The from field should have an e-mail address of origin.  I like to set them up so that they come from the name of the scenario (Searcherizer, in this case) running on my server, but you don't have to.  The body field is the body of the email, and should reference the message field in the events output by the previous Event Formatting Agent.  This is the actual search report that will land in your inbox.  Now let's test it: Go back to the "Searcherizer - Search RSS: searx public instance" agent and click the Actions drop-down menu, then Run.  If you set it up correctly, in a few minutes you should see the first search report in your inbox.

You can download the Searcherizer scenario from my Github repository to experiment with.

To be fair, there are a couple of shortcomings.  If you are accustomed to Google almost reading your mind because they have a construct so well trained on your search activity the output of Searcherizer is probably going to be somewhat jarring to you.  The tradeoff here is that you get some more privacy at the expense of "do what I mean" accuracy; to put it another way, by escaping your personal filter bubble you lose its benefits, seductive though they may be.  Personally, I find that Google Alerts aren't particularly accurate 90% of the time, so the perceived accuracy or lack thereof from a given Searx instance matches is kind of irrelevant.  I tend to get better results using searcherizers but what I search for probably isn't what you're interested in.  Second, the search engines a given Searx instance is set up for can muddy the waters somewhat.  Not every search source is useful or appropriate for your use case.  I don't find the Erowid (link anonymized) search engine terribly helpful when I'm looking for information on creating conversational interfaces because it tends to send me links to trip reports about 5-MeO-MIPT (link anonymized) or something similar.  Not every search provider is helpful so consider your Searx configuration carefully.

Possible improvements:

  • Add more searches to this agent network.
  • It's a little inefficient to run a sentiment analysis on each search hit before deduplication.  Modify the network to do this instead.
  • Send the search report to more than one e-mail address.
  • Send one e-mail per hit instead of a daily batch.