Technomancer Tools: YaCy

Oct 28 2017

If you've been squirreling away information for any length of time, chances are you tried to keep it all organized for a certain period of time and then gave up the effort when the volume reached a certain point.  Everybody has therir limit to how hard they'll struggle to keep things organized, and past that point there are really only two options: Give up, or bring in help.  And by 'help' I mean a search engine of some kind that indexes all of your stuff and makes it searchable so you can find what you need.  The idea is, let the software do the work while the user just runs queries against its database to find the documents on demand.  Practically every search engine parses HTML to get at the content but there are others that can read PDF files, Microsoft Word documents, spreadsheets, plain text, and occasionally even RSS or ATOM feeds.  Since I started offloading some file downloading duties to yet another bot my ability to rename files sanely has... let's be honest... it's been gone for years.  Generally speaking, if I need something I have to search for it or it's just not getting done.  So here's how I fill that particular niche in my software ecosystem.

First a little bit about how search engines work.  There are four basic modules, of which you're usually only aware of one because it's the only one you interact with.  The first is a spider which, when given a URL to start from, downloads documents and parses them to find links to other documents to download.  The spider passes the documents to an indexer, which analyzes and builds an index of the contents of the documents and stores them into a database.  The third is a search module that queries the index for matching documents and builds pages of links to the results.  The fourth and most visible is the user interface, which is usually a web page loaded in a browser but there are other ways of using one.  Some search engines index files on the same host the search engine is running on (so-called personal search engines) while most index files and documents on web server.  Finding one which does both concurrently seems to be a bit problematic as I'll talk about later.

When I was first building out this part of my system architecture I fumbled around for a good, long while because search is a difficult thing to do reasonably well but trivial to do badly.  I eventually stumbled across the YaCy open source search engine and over the years I've become accustomed to its quirks.  It's based upon, among many other components the Apache Solr search server (which is heavy-duty but not easy to work with alone) but adds an interactive search page on top as well as a peer-to-peer networking system.  One of YaCy's big claims to fame is that there is a global network of YaCy servers run by people building a huge, distributed search engine that is designed with privacy in mind.  YaCy is also useful for building a private search engine, say, if you want to index your intranet without giving untrusted third parties access to the information there.  My primary use case (well, part of it) is maintaining an index of web pages that catch my attention, so I can find them again when I need to with a fairly quick simple search.  I generally run those searches with a copy of web_search_bot/ running on the server but at home I can just load the YaCy search page in my web browser.  On the other side, most of the time I use one of my bots to submit URLs to YaCy to index, but I also have a couple of index runs scheduled for entries in some RSS feeds so I don't have to worry about keeping on top of them myself.

When I first installed YaCy I used the official Arch Linux instructions, which worked about as seamlessly as one could hope.  Figuring out how to configure YaCy took a bit longer, and I strongly recommend watching at least a couple of the tutorials on Youtube before taking a shot at it yourself.  Here is a breakdown of my configuration settings, based upon wanting a server that could index parts of the public Net without participating in the YaCy peer-to-peer network:

  • Use Case & Accounts -> Basic Configuration
    • Set language to English.
    • "Search portal for your own web pages"
    • Set peer name.
  • Use Case & Accounts -> User Accounts
    • "Access from localhost without account" (because I have YaCy proxied with Nginx, this works).
  • Use Case & Accounts -> Network Configuration
    • Check "Robinson Mode"
    • Check "Private Peer"
  • RAM/Disk Usage & Updates
    • Change the value of "Memory Settings" to 2048 megabytes (you'll probably need more).
    • Disable crawls below 10240 MiB free space
  • System Administration
    • Advanced Properties
    • At the bottom of the page you'll see two form fields: One for a configuration key name, the other for a configuration key value.  Type the word "robots" (no quotes) into the left-hand field.
    • Click on "obeyHtmlRobotsNofollow" and type the word "false" into the right-hand field if it doesn't exist.
    • Click on "obeyHtmlRobotsNoindex" and type the word "false" into the right-hand field if it doesn't exist.

YaCy has a somewhat tricky to use API that required a few attempts to get working.  I had to play around with running a few searches on my server so I could copy the URL and re-reading the crawler API documentation before I got something that worked.  Here's the query URL I use for submitting links for indexing (broken up across multiple lines for readability):

http://localhost:8090/Crawler_p.html?range=wide&crawlingQ=on&crawlingMode=url
    &crawlingstart=Neuen%20Crawl%20starten&mustmatch=.*&xsstopw=on&indexMedia=on
    &cachePolicy=nocache&indexText=on&crawlingDomFilterDepth=1
    &crawlingDomFilterCheck=on&crawlingDepth=1&crawlingURL=

Here's what everything means:

  • range=wide - Specifies the crawling strategy to use on the target.  The default is "wide" but I put it here explicitly to document the mode used.
  • crawlingQ=on - When the spider sees a URL with a question mark in it, it usually means that the content is dynamically generated, and to index it anyway.
  • crawlingMode=url - Start crawling from the URL given at the end of this string.
  • crawlingstart=Neuen%20crawl%20starten - Needed to start a new crawl.  I don't think the value matters.
  • mustmatch=.* - A regular expression that must match the URLs to be spidered.  I use /.*/ (any one character followed by zero or more any one characters, or "every URL you see") for this.
  • xsstopw=on - Enable stop words.
  • indexMedia=on - Index metadata of multimedia content.
  • cachePolicy=nocache - Never re-read cached content, always re-download it.
  • indexText-on - Index the content of the web pages seen.
  • crawlingDomFilterDepth=1 - I don't know what this means, but YaCy puts it in by default even though it seems to be deprecated.
  • crawlingDomFilterCheck=on - I don't know what this means, but YaCy puts it in by default even though it seems to be deprecated.
  • crawlingDepth=1 - How often the crawler will follow links hanging off of the URL it's indexing.  I have YaCy download and spider not only every URL I give it, but every URL hanging off of that page.
    • A value of 2 would mean "the URL I give it, every URL hanging off of that page, and every URL hanging off of those linked pages."
    • A value of 3 would mean "the URL I give it, every URL hanging off of that page, every URL hanging off of those linked pages, and every URL hanging off of those pages in turn."
    • I think you see how this works.  I found out the hard way that values greater than 2 were counterproductive in the extreme.
  • crawlingURL=<this the URL to spider and is appended by web_index_bot/>

Due to the fact that Leandra is in a somewhat bandwidth-constrained environment at this time, I do not have YaCy configured to take part in the global search network.  Additionally, when I finally get YaCy indexing my documents (if it's feasible to do so) I don't want those indices spread across the YaCy search network for privacy-related reasons.  I don't mind helping the entire network at all but I also don't want indices of my personal stuff floatingn around out there.  I also need to maintain an index of the many documents I have scattered around my various storage arrays, but a limitation of YaCy is that I cannot both index parts of the Net and the same system YaCy is running on.  At the moment this is a feature and not a bug because it protects my privacy.  I haven't figured out how I'm going to do this just yet.  Maybe I need to run multiple copies of YaCy on Leandra, maybe I need to run one copy on a separate server and one on Leandra to index my documents, maybe I need to find a second search engine for my documents.  I don't know yet.

Bonus: Here's how you can set up YaCy to periodically pull an RSS feed and index every new post it finds.  Let's use mine (https://drwho.virtadpt.net/rss/feed.xml) as an example:

  • YaCy Administration page
  • Index Export/Import
  • RSS Feed Importer
  • Paste the URL of the RSS feed into the form and click the "Show RSS Items" button.
  • Change the Indexing selector so that "scheduled" is selected.  Leave the schedule set to "every 7 days."
  • Click the "Add All Items to Index (full content of url)" button to tell YaCy to pull that RSS feed every week and index every item it finds.

You can also have YaCy automatically import and index Mediawiki dumpfiles (YaCy Administration -> Index Export/Import -> Dump Reader for MediaWiki Dumps) but I've never done that before so I don't know how well it works.