It should come as little surprise to anyone out there that I have a bit of a problem with hoarding data. Books, music, and of course files of all kinds that I download and read or use in a project for something. Legal briefs, research papers (arXiv is the bane of my existence), stuff people ask me to review, the odd Humble Bundle... So much so that a scant few years ago I rebuilt Leandra to better handle the volume of data in my library. However, it's taken me this long to both figure out and get around to making it easier to find anything in all that mess. If I can't find it, I can't do anything with it, or even figure out what I do or don't have. I also don't often have console access so it's not as if I can SSH in and grep for what I need. I use Nginx as a web server on Leandra so actually getting access to files when I need them is trivial.
A couple of weeks back, I found myself in a discussion with a couple of friends about searching on the Internet and how easy it is to get caught up in a filter bubble and not realize it. To put not too fine a point on it, because the big search engines (Google, Bing, and so forth) profile users individually and tailor search results to analyses of their search histories (and other personal data they have access to), it's very easy to forget that there are other things out there that you don't know about for the simple reason that they don't show stuff outside of that profile they've built up. If you're a hardcore code hacker you might find it very difficult to find poetry or the name of a television show you saw once unless you take fairly drastic action. The up-side of this profiling is that, inside of your statistical profile search results are great. You can find what you need, when you need it. But outside of that? Good luck.
The point of the discussion was that there were ways that we could escape this filter bubble through application of self-hosted software and a little cooperation.
Ironically, searching through my conversation history I can't seem to find the thread in question so I'm relying entirely upon on-board storage (as it were). So, go ahead and laugh while I geek out. First, a little bit of Internet history.
As the title of this post implies, I've been working on some stuff lately that's been taking up enough compute cycles that I haven't been around to post much. Some of this is due to work, because we're getting into the really busy time of year and when I haven't been at work I've been relaxing. Some of this is due to yet another run of dental work that, while it hasn't really been worth writing about has resulted in my going to bed and sleeping straight through until the next day. And some of it's due to my hacking on a new project that wound up being... not as hard as I'd imagined it would be, but there certainly has been a steep learning curve.
UPDATED: Added an Nginx configuration block to proxy YaCy.
If you've been squirreling away information for any length of time, chances are you tried to keep it all organized for a certain period of time and then gave up the effort when the volume reached a certain point. Everybody has therir limit to how hard they'll struggle to keep things organized, and past that point there are really only two options: Give up, or bring in help. And by 'help' I mean a search engine of some kind that indexes all of your stuff and makes it searchable so you can find what you need. The idea is, let the software do the work while the user just runs queries against its database to find the documents on demand. Practically every search engine parses HTML to get at the content but there are others that can read PDF files, Microsoft Word documents, spreadsheets, plain text, and occasionally even RSS or ATOM feeds. Since I started offloading some file downloading duties to yet another bot my ability to rename files sanely has... let's be honest... it's been gone for years. Generally speaking, if I need something I have to search for it or it's just not getting done. So here's how I fill that particular niche in my software ecosystem.
A Google feature that doesn't ordinarily get a lot of attention is Google Alerts, which is a service that sends you links to things that match certain search terms on a periodic basis. Some people use it for vanity searching because they have a personal brand to maintain, some people use it to keep on top of a rare thing they're interested in (anyone remember the show Probe?), some people use it for bargain hunting, some people use it for intel collection... however, this is all predicated on Google finding out what you're interested in, certainly interested enough to have it send you the latest search results on a periodic basis. Not everybody's okay with that.
A while ago, I built my own version of Google Alerts using a couple of tools already integrated into my exocortex which I use to periodically run searches, gather information, and compile reports to read when I have a spare moment. The advantage to this is that the only entities that know about what I'm interested in are other parts of me, and it's as flexible as I care to make it. The disadvantage is that I have some infrastructure to maintain, but as I'll get to in a bit there are ways to mitigate the amount of effort required. Here's how I did it...