Sometimes the old ways may be best.

Feb 02 2019

A couple of weeks back, I found myself in a discussion with a couple of friends about searching on the Internet and how easy it is to get caught up in a filter bubble and not realize it.  To put not too fine a point on it, because the big search engines (Google, Bing, and so forth) profile users individually and tailor search results to analyses of their search histories (and other personal data they have access to), it's very easy to forget that there are other things out there that you don't know about for the simple reason that they don't show stuff outside of that profile they've built up.  If you're a hardcore code hacker you might find it very difficult to find poetry or the name of a television show you saw once unless you take fairly drastic action.  The up-side of this profiling is that, inside of your statistical profile search results are great.  You can find what you need, when you need it.  But outside of that?  Good luck.

The point of the discussion was that there were ways that we could escape this filter bubble through application of self-hosted software and a little cooperation.

Ironically, searching through my conversation history I can't seem to find the thread in question so I'm relying entirely upon on-board storage (as it were).  So, go ahead and laugh while I geek out.  First, a little bit of Internet history.

Way, way, way back when, before Google was a thing and people still maintained personal homepages, it was not uncommon for people to post and curate lists of links to things they liked.  This was pretty much how the web as we know it came to be, people linking to stuff that linked to other stuff, and so forth.  As time passed some folks began collaborating on what net.history refers to as web directories, which were basically lists of hyperlinks (usually on the same topic) maintained by small groups of people.  Curated web directories got more and more popular until the grand-daddy of them all, DMOZ (directory.mozilla.org) came to be in 1998.  It doesn't exist anymore because it was bought out by AOL (which was later bought out by Verizon, under their Oath brand) and shut down a few years ago, but an open-source copy still exists online at dmoz-odp.org.  For a long while it was probably the largest index of curated online resources out there.  It was for this reason that it was a popular starting point for many search engines back in the day, from Lycos and Yahoo all the way to the very earliest iteration of Google and college projects pertaining to web crawling, indexing, and search.

This was the direction the discussion I mentioned earlier went in.  All of us curate our own lists of useful (and, most importantly, still active) bookmarks online and we all have an interest in search and indexing of data.  This is what we came up with:

Take a personal bookmarking system like Shaarli, which stores personal databases of useful links and occasionally notes.  They can be private (you can't see the contents unless you're logged in), they can be public (the contents are all out there for anyone to look at), or they can be a combination of the two (some links kept private, some links public).  Shaarli offers extremely flexible RSS and ATOM feeds of new content added to its database.  It's nowhere near the size of DMOZ but your average Shaarli install is a good place to start.  You could certainly do worse.

Take a federated, open source data search and indexing system like YaCy.  Configure it so that you have a personal search engine.  Assuming for the sake of this discussion that you have enough bandwidth to participate in the global YaCy network, do not configure it for Robinson Mode (not participating in the global YaCy network), just leave it as-is.  Now, pull up the RSS feed for your Shaarli instance by adding ?do=rss to the end of the URL; for example, here's mine.  If you have any private links in there don't worry, they won't appear in the RSS or ATOM feed.  Tell YaCy to load your Shaarli RSS feed:

  • YaCy Administration Page
  • Index Export/Import
  • RSS Feed Importer
  • Paste your Shaarli RSS feed's URL here
  • Click "Show RSS items" to make sure it can load the feed
  • Change the indexing selector so that it says "Scheduled"
  • Change the schedule to "Every 1 day"
  • Click the "Add All Items to Index (full content of url)" button to tell YaCy to pull that RSS feed every week and index every item it finds.

What this does it tell YaCy to pull the RSS feed of your bookmark collection daily and index whatever new entries it finds.  This will slowly grow a search index of new websites as you bookmark them.  If you've configured YaCy to participate in the global YaCy peer-to-peer network, you are also helping the global index of the Internet it's building to grow by adding carefully curated links to things you find personally useful and/or interesting.

The $64kus question is, will this replace Google?  An honest answer is, no, probably not.  Google is a megacorp that throws billions of dollars at software development every year, with armies of software developers working around the clock.  The YaCy network is run by a bunch of computer enthusiasts.  What this does accomplish, however, is help construct a search engine that does not have a filter bubble (or at least, less of a filter bubble) because there are no profiling and ranking algorithms deciding what to show or not show a given user at a given time.  The contents of the YaCy search index will be more carefully curated because they were selected by people for specific reasons with specific outcomes in mind.  You will probably not get the perfectly tailored search results of Google, but you will get search results that are definitely relevant to your interests because they were added by someone with similar interests and needs to your own.  Could it be gamed?  Probably.  It would be difficult because a rogue YaCy instance would need to be hacked to index specific data in specific, bad actor defined ways.  It would be difficult but not impossible, and probably not worth the time and effort which could be "better" spent gaming Google, Bing, et al with easier and better known tactics.