Interfacing Fess with Searx.
I promise I'll explain what Fess is in a later post. I want to get this information out there in preparation.
If you haven't used Searx before, it's a self-hosted meta-search engine which queries a wide array of search engines (some of which are also self-hosted), collates the search results, and returns them as a regular search result page, an RSS feed, or a JSON API.
One of the lesser known features is that you can add your own search engines. You can either write your own (using an existing one as a template) or you can leverage one of the generic search adapters for your use case. In my case, I replaced YaCy with Fess for the purpose of indexing my research data hoard on Leandra, for reasons I'll go into later on. Suffice it to say that I spent some time figuring out how to hook Fess' JSON API into Searx using the JSON engine adapter. But first, a little bit about how it works...
The JSON search engine adapter makes an HTTP(S) request to the service you configure it for, takes the JSON document that comes back, and figures out how to extract the search results as well as the individual parts of each search result that constitute useful information. When you set up a JSON engine entry for Searx you have to suss out the useful bits. For that purpose, I would recommend using something like cURL to get the search data, an online service like beautifier.io to make JSON document easier to read, and a JSON path explorer like jsonpath.com to pick the interesting bits out.
Unfortunately, the JSON search engine adapter isn't well documented so I had to fool around with a couple of the other search engines that also use it to figure out what to do.
So, let's say that we get a search results from Fess that looks a little like this (trimmed way down for brevity):
{
"response": {
"version": 13.9,
"status": 0,
"q": "virtual adepts",
"query_id": "fdde0d6f9e244dd2ad40b433b655dc46",
"exec_time": 0.07,
"query_time": 20,
"page_size": 10,
"page_number": 1,
"record_count": 737,
"record_count_relation": "EQUAL_TO",
"page_count": 74,
"highlight_params": "&hq=virtual&hq=adepts",
"next_page": true,
"prev_page": false,
"start_record_number": 1,
"end_record_number": 10,
"page_numbers": [
"1",
"2",
"3",
"4",
"5",
"6"
],
"partial": false,
"search_query": "virtual adepts",
"requested_time": 1606104020777,
"related_query": [],
"related_contents": [],
"result": [
{
"filetype": "html",
"url_link": "http://leandra/archive/ftp/virtual-adepts/",
"created": "2020-11-22T10:25:50.139Z",
"site_path": "leandra/archive/ftp/virtual-adepts/",
"title": "Index of /archive/ftp/virtual-adepts/",
"doc_id": "6be0595843bc49a8b345f5588e015e7e",
"url": "http://leandra/archive/ftp/virtual-adepts/",
"content_description": "of /archive/ftp/virtual-adepts/ ../ Cyberpunk_Fiction/...",
"content_title": "Index of /archive/ftp/virtual-adepts",
"site": "leandra/archive/ftp/virtual-adepts/",
"digest": "Index of /archive/ftp/virtual-adepts/ ",
"host": "leandra",
"boost": "1.0",
"mimetype": "text/html",
"_id": "10befd67b304e3951a09f2b82a3f276e441242b...",
"content_length": "3366",
"timestamp": "2020-11-22T10:25:50.139Z"
},
...
As you can see, we care about what's in the "result": [] array - those are our search results from Fess. That's where the tasty bits are. Logically, one of the things you have to give Searx is a URL where Fess is listening, but you'll have to modify it a bit to take arbitrary parameters from Searx. Let's look at mine:
http://localhost:8080/json?q={query}&start={pageno}&num=20&sort
/json
is the JSON search API endpoint.- ?q={query} is the standardized search query URI.
{query}
is a Jinja2 template tag that basically means "insert search terms here" &start={pageno}
stands for "return page pageno of search results."&num=20
- return 20 results per page. This is arbitrary.&sort
- return search hits sorted with best seeming hits first.
The rest of the information we need comes out of the search results. I've documented the salient ones below along with what they mean.
- name : Fess # arbitrary name for the Searx service
engine : json_engine
paging : True # this means that the search engine can return pages of results
shortcut : fess # type !fess to only search Fess
search_url : http://localhost:8080/json?q={query}&start={pageno}&num=20&sort
results_query : result # search results are found in the JSON key $.response.result, which is a list. Everything else is relative to entries of this list.
url_query : url_link # URL of the search hit is found in the JSON key $.response.result.*.url_link
title_query : title # title of the search hit is found in JSON key $.response.result.*.title
content_query : digest # summary of the hit is found in JSON key $.response.result.*.digest
page_size : 20 # pages of search hits contain as most 20 entries each
categories : files # one or more of files, general, images, it, music, news, onions, science, social media, videos
timeout : 60 # give up if Fess doesn't respond after 60 seconds
disabled : False # not disabled
So, if you plug the above into your searx/searx/settings.yml file and restart Searx, you'll be able to send search requests to a local Fass instance, just like any other search engine Searx knows about.