It should come as little surprise to anyone out there that I have a bit of a problem with hoarding data. Books, music, and of course files of all kinds that I download and read or use in a project for something. Legal briefs, research papers (arXiv is the bane of my existence), stuff people ask me to review, the odd Humble Bundle... So much so that a scant few years ago I rebuilt Leandra to better handle the volume of data in my library. However, it's taken me this long to both figure out and get around to making it easier to find anything in all that mess. If I can't find it, I can't do anything with it, or even figure out what I do or don't have. I also don't often have console access so it's not as if I can SSH in and grep for what I need. I use Nginx as a web server on Leandra so actually getting access to files when I need them is trivial.
A couple of weeks back, I found myself in a discussion with a couple of friends about searching on the Internet and how easy it is to get caught up in a filter bubble and not realize it. To put not too fine a point on it, because the big search engines (Google, Bing, and so forth) profile users individually and tailor search results to analyses of their search histories (and other personal data they have access to), it's very easy to forget that there are other things out there that you don't know about for the simple reason that they don't show stuff outside of that profile they've built up. If you're a hardcore code hacker you might find it very difficult to find poetry or the name of a television show you saw once unless you take fairly drastic action. The up-side of this profiling is that, inside of your statistical profile search results are great. You can find what you need, when you need it. But outside of that? Good luck.
The point of the discussion was that there were ways that we could escape this filter bubble through application of self-hosted software and a little cooperation.
Ironically, searching through my conversation history I can't seem to find the thread in question so I'm relying entirely upon on-board storage (as it were). So, go ahead and laugh while I geek out. First, a little bit of Internet history.
UPDATED: Added an Nginx configuration block to proxy YaCy.
If you've been squirreling away information for any length of time, chances are you tried to keep it all organized for a certain period of time and then gave up the effort when the volume reached a certain point. Everybody has therir limit to how hard they'll struggle to keep things organized, and past that point there are really only two options: Give up, or bring in help. And by 'help' I mean a search engine of some kind that indexes all of your stuff and makes it searchable so you can find what you need. The idea is, let the software do the work while the user just runs queries against its database to find the documents on demand. Practically every search engine parses HTML to get at the content but there are others that can read PDF files, Microsoft Word documents, spreadsheets, plain text, and occasionally even RSS or ATOM feeds. Since I started offloading some file downloading duties to yet another bot my ability to rename files sanely has... let's be honest... it's been gone for years. Generally speaking, if I need something I have to search for it or it's just not getting done. So here's how I fill that particular niche in my software ecosystem.