Updating the Search Function of my Website.

Not too long ago I got fed up with how good a job Duckduckgo's site search feature wasn't doing. No matter what I did I couldn't find dick around here. And, folksonomies being what they are, unless you plan them (and then they won't be folksonomies) you probably won't remember what tags you used. It's frustrating to get get lost in what amounts to your own house. So, one night I got well and fed up and decided to put some of my spare computing power to use. I did a walk-around of my exocortex and figured out that Jackpoint* had some RAM and a core free. So... off to one of my favorite pieces of software.

Installing YaCy is pretty easy if you read the directions (and even if you've done it a few times it's still a good idea). So I installed a headless JDK (sudo apt-get install openjdk-8-jdk-headless) on Jackpoint and then a clone of the source code for YaCy. I then had to configure nginx to proxy YaCy and cache the static HTML stuff. For the sake of completeness because this is a fairly common thing people ask about, here are the relevant parts from the file /etc/nginx/sites-enabled/heterochromia.virtadpt.net on Jackpoint:

        location / {
                proxy_http_version 1.1;
                proxy_buffering off;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Proxy "";
                proxy_pass http://127.0.0.1:8090/;
                proxy_redirect off;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                client_max_body_size       10m;
                client_body_buffer_size    128k;
                proxy_connect_timeout      10s;
                proxy_send_timeout         10s;
                proxy_read_timeout         10s;
                proxy_buffer_size          4k;
                proxy_buffers              4 32k;
                proxy_busy_buffers_size    64k;
                proxy_temp_file_write_size 64k;
        }
        location /env {
            proxy_pass http://127.0.0.1:8090/env;
        }

Once that was done it was easy to set up a periodic indexing run of my website:

  • YaCy Administration -> Load Web Pages, Crawler
  • Site: https://drwho.virtadpt.net/
  • Start New Crawl
  • Process Scheduler -> crawl start for https://drwho.virtadpt.net/
  • Event Trigger -> no event
  • Scheduler -> 7 days
  • Execute Selected Actions
  • Index Export/Import -> RSS Feed Importer
  • URL of the RSS Feed: https://drwho.virtadpt.net/rss/feed.xml
  • Show RSS Items
  • Indexing -> scheduled -> repeat the feed loading every 7 days automatically
  • Add All Items to Index (full content of URL)

However, there were two problems to solve: YaCy, being a search engine, spiders not just my website in this case but also every link leading away from my website. This meant that it was more likely to return hits for stuff I linked to and not hits on my website (which was the whole point). The other problem was modifying my website's theme such that there was a YaCy search box instead of a DDG search box. But let's tackle one problem at a time.

It took some tinkering before I figured out the first problem. The solution that worked was the following process:

  • YaCy Administration -> Ranking and Heuristics
  • Filter Query -> fq= host_s:drwho.virtadpt.net
  • Set Filter Query

I also tweaked some of the settings in the Solr Boosts part of the same page to get more specific and accurate search hits (explanations of what these settings mean are right next to them so refer to your own YaCy server for details):

  • sku: 1.25
  • title: 3.0
  • host_s: 6.0
  • dates_in_content_dts: 1.0
  • description_txt: 2.0
  • keywords: 3.0
  • text_t: 8.0
  • synonyms_ext: 1.0
  • url_file_name_s: 1.0
  • url_file_name_tokens_t: 4.0
  • url_paths_sxt: 3.0
  • click Set Field Boosts

After some experimentation I settled on the above settings. That brought me right along to the second problem: Integration. In the YaCy Administration panel scroll down to Search Portal Integration -> Portal Configuration -> Search Box Anywhere and you'll see some boilerplate HTML generated by YaCy:

<form method="get" accept-charset="UTF-8"
    action="http://heterochromia.virtadpt.net/yacysearch.html">
  <div style="text-align:center; padding:5px; background-color:#eeeeee;
    border:1px solid #cccccc; -webkit-border-radius:5px;
    -moz-border-radius:5px; border-radius:5px; display:block; float:left;
    margin-right:5px;">
    <div style="font-family:Arial,Helvetica,sans-serif; font-size:16px;
    display:block; float:left; padding-top:3px; padding-right:5px;">
    MySearch
    </div>
    <input type="text" name="query" value="" maxlength="80" 
           style="width:300px; font-size:16px; float:left;" />
    <input type="hidden" name="verify" value="cacheonly" />
    <input type="hidden" name="maximumRecords" value="10" />
    <input type="hidden" name="meanCount" value="5" />
    <input type="hidden" name="resource" value="local" />
    <input type="hidden" name="urlmaskfilter" value=".*" />
    <input type="hidden" name="prefermaskfilter" value="" />
    <input type="hidden" name="display" value="2" />
    <input type="hidden" name="nav" value="all" />
    <div style="font-size:18px; display:block; float:right; padding-top:1px;">
      <input type="submit" name="Enter" value="Search" />
    </div>
  </div>
  <p style="clear:both;"></p>
</form>

Nice, but not really what I was after. So, I spliced the code into the local copy of my website theme anyway and set about tinkering with the bits and pieces of the HTML form parts, using the local test server (make serve) to troubleshoot. The process required editing the base.html template file because every page on my site is ultimately built on top of that file. After a lot of tinkering, site rebuilding, and page refreshes I finally settled on the following chunk of HTML:

<!-- Search -->
<section class="box search">
<form method="get" accept-charset="UTF-8" target="_blank"
    action="https://heterochromia.virtadpt.net/yacysearch.html">
<div style="padding: 0; -webkit-border-radius: 0.2em;
    -moz-border-radius: 0.2em; border-radius: 0.2em; margin: 0;">
<input type="text" name="query" value="" maxlength="80"
    style="width:180px; " />
    <input type="hidden" name="verify" value="cacheonly" />
    <input type="hidden" name="maximumRecords" value="20" />
    <input type="hidden" name="meanCount" value="5" />
    <input type="hidden" name="resource" value="local" />
    <input type="hidden" name="urlmaskfilter" value=".*" />
    <input type="hidden" name="prefermaskfilter" value="" />
    <input type="hidden" name="display" value="2" />
    <input type="hidden" name="nav" value="all" />
</div>
</form>
</section>

And much to my surprise, it worked. If you look at the search bar on my website you can search my website and the search results will open in a new browser tab. Or you can go right to my search engine directly at https://heterochromia.virtadpt.net/

Happy hacking!

* Why did I name this server Jackpoint? When I originally set up this virtual machine as a Prosody server, it's the only hostname that came to mind because I was in kind of a hurry.