Organizing a data hoard with YaCy.

Feb 02 2019

 It should come as little surprise to anyone out there that I have a bit of a problem with hoarding data.  Books, music, and of course files of all kinds that I download and read or use in a project for something.  Legal briefs, research papers (arXiv is the bane of my existence), stuff people ask me to review, the odd Humble Bundle... So much so that a scant few years ago I rebuilt Leandra to better handle the volume of data in my library.  However, it's taken me this long to both figure out and get around to making it easier to find anything in all that mess.  If I can't find it, I can't do anything with it, or even figure out what I do or don't have.  I also don't often have console access so it's not as if I can SSH in and grep for what I need.  I use Nginx as a web server on Leandra so actually getting access to files when I need them is trivial.

So, I did some research, did some thinking, ran quite a few miles, and decided that it made more sense to just set up another instance of YaCy on Leandra with a slightly different configuration.  For a while I'd considered running YaCy on a separate machine on my network as a dedicated search appliance but when I ran the numbers I realized that it would be pulling terabytes of data on a daily if not weekly basis across the network.  I ran a couple of tests on a RasPi and wasn't able to keep it online due to its limited memory (I've since found a couple of tips for doing just this on the YaCy wiki now that it's back online).  That, coupled with the network traffic issue, ruled out this strategy.  I tried a couple of other information management packages but none of them really did what I wanted in the way I wanted.  Eventually I opted to take the bull by the horns and go with the most straightforward solution: Clone the source code for YaCy onto Leandra, build it from the instructions, and fire it up.  Due to how YaCy uses databases it didn't make sense to try to hammer one copy installed from an Arch Linux AUR package into doing double duty, so the most logical thing to do was deal with the fact that there were two running copies and two separate indexes and go do more interesting things.

I had to configure YaCy to listen on a different port because the default of 8090/tcp is occupied by the other instance that indexes all of the stuff I throw at it.  When it was up and running I configured it for Intranet Indexing mode (YaCy Administration -> Use Case & Account -> check Intranet Indexing) and gave the search engine a name.  Intranet mode basically tells YaCy "Figure out the local network you're on and only index things you find on that network because you're acting as an in-house search portal."  I'm not certain, but I think this also implicitly prevents YaCy from interacting with the global YaCy peer-to-peer network, because it doesn't make any sense to scatter information about private data to the four winds.  Next was kicking off an initial indexing run of my library (YaCy Administration -> Load Web Pages, Crawler -> Site: http://leandra/ -> click "Start New Crawl").  Due to how networking functions in Linux, YaCy's search traffic basically hit the network card, took a hairpin turn right back around to the webserver, and at no time did it impact the rest of my network.  While the initial indexing run was going I set the same job to run on a daily basis (YaCy Administration -> Process Scheduler -> pick the "crawl start for http://leandra/" -> Scheduler column -> change "no repetition" to "activate scheduler" -> every 1 day -> click "Execute Selected Actions") so that the index would always be relatively up to date.

Now, here's something weird: I tried adding this second YaCy server to the Searx configs that one of my web_search_bot/s use but for some reason it didn't seem to work.  I tried a couple of variations but wasn't able to get any search results out of stuff I know I have.  Rather than waste more time fighting with it, I opted to take the path of least resistance and set up another Searx instance on Leandra on a different port.  This copy of Searx is configured to only support the YaCy search engine so it is little more than an API provider (also meaning that I don't have to write a YaCy specific bot), made another copy of web_search_bot.conf called library_bot.conf on Leandra, and configured it to use the second Searx API to communicate with the second copy of YaCy.  The URL to do this looks a little weird because it only makes calls to YaCy and nothing else (another weird thing I have to look into) and happens to be this: http://127.0.0.1:9999/?format=json&q=%21yy%20

I made another copy of the run.sh wrapper script for web_search_bot/ called librarybot.sh and edited it to load the library_bot.conf file, like this:

#!/usr/bin/env bash

export VIRTUAL_ENV="$(pwd)/env"
export PATH=$VIRTUAL_ENV/bin:$PATH
export PS1="(virtualenv) $PS1"
unset PYTHON_HOME

eval exec "./web_search_bot.py --config library_bot.conf"

exit 0

I added a new message queue for the bot to the config file for the XMPP bridge, restarted the XMPP bridge, and started up my Librarybot variant.  Lo and behold, I can now seach my files on Leandra from an XMPP client.

YaCy wound up being the right tool for the job in the long run for the simple reason that it can parse and index practically every type of file I have, from standard, boring ASCII text files to tabular data and CAD drawings.  Also, after trying a bunch of different search packages, from GNOME Tracker to Open Semantic Search to Perkeep to Ambar, YaCy just does what I need with a minimum of screwing around and writing shims to trick software into doing what I wanted.  There's no shortage of search libraries and frameworks out there, and in theory I could have written my own document search engine (and I might do something like that one day).  But ultimately it came down to a fairly simple decision: Do I want to spent time writing a search engine and deal with the pain in the ass of not being able to find stuff for much longer, or do I want to go with a solution that I know works, will do what I want, and can get me on the road?