Combining Manticore and Searx.

13 May 2022

Difficulty: Advanced.

One of these days I'll get around to doing a writeup of an indispensible part of my exocortex, Wallabag. I used it to replace my old paywall breaker program, largely because pumping random articles from the web into a copy of etherpad-lite was janky as hell and did not make for a good user experience. To put it another way, when you're looking for a particular thing in your archive it's a huge time sink to then go through and edit the saved document because it's a single huge line of text. At least Wallabag saves copies of pages that look like you tapped the reader view icon in your browser, which makes quite a few pages much more readable if nothing else.

But I'm tired, and I digress.

One of my complaints about Wallabag is that the search function is not only slow but highly inaccurate. If you dig around in the source code a little you'll find somewhere in the Doctrine migations1 where the table wallabag_entry is set up. Here's the bit that's interesting:

mysql> explain wallabag_entry;
+------------------+-------------+------+-----+---------+----------------+
| Field            | Type        | Null | Key | Default | Extra          |
+------------------+-------------+------+-----+---------+----------------+
...
| content          | longtext    | YES  |     | NULL    |                |
...

It's not actually Wallabag to blame, it's MySQL. The actual content of each page you store in Wallabag goes into the field content which is of type longtext (which is basically a data type that you can throw up to four gigabytes of data into). Unfortunately, searching longtext columns in a MySQL database is really, really slow. Maddenly so. Even on the professionally tuned and maintained hosted database server I use. We're talking "start a search and go to the bathroom" slow.

Okay, maybe not that slow, but it's still annoying.

I started thinking about this problem a couple of months ago2 and I realized that it reminded me of when I used to admin a couple of Mediawikis a few jobs back. The users had the same problem with the search function, so what I did was I set up a couple of copies of a search engine called Sphinx, which logged into the database, built its own indices using a configuration file I'd designed, and replaced Mediawiki's search function with its own using a pretty nifty plugin. There's just one problem these days: Sphinx isn't really open source anymore. I don't know if their "delayed FOSS" license has ever actually come to anything. So, back to the drawing board.

Just after Sphinx went weird a couple of folks forked the source code and renamed it Manticore back in 2017.ev. And they've done an incredible amount of work on it, not only fixing a great many bugs in the original Sphinx software but they've improved it dramatically. And, of course, it's still open source. So, I used the official installation instructions to set it up on the server in question.

I need to do a little expectation management: This wound up being a somewhat challenging project because I kept running into gotchas left and right. I'm going to document those gotchas because, weirdly, they're not clearly documented. To put it another way, even after reading the docs a few times start to finish I still had to do a fair amount of text excavation to figure out just why things weren't working the way I should. I also had a hell of a time putting together a configuration file because the individual settings appear well documented but the overall structure of the configuration file is not.

First we need to figure out what parts of the database to index. I won't reprint the entire schema for the wallabag_entry table because it's pretty lengthy and most of it is irrelevant. Here are the columns that are important:

  • id
    • Manticore requires that everything it indexes have a unique numerical identifier.
    • The data type here is BIGINT, so the maximum value it can have is 2^64-1, which should be plenty.
  • title
  • url
    • The original URL archived.
  • content
  • archived_at
    • The date the article was grabbed by Wallabag.

Given all of this data we can now configure Manticore to index the Wallabag database. The config file is pretty complex so I'm not going to quote the whole thing; it makes more sense to link to it (Github) (Gitlab) (git.hackers.town) and go over only the most relevant parts. It took me quite a bit of trial and error to put together a skeleton of a config file and not a few failures to get a real-time index up and running so I'd like to help by publishing usable boilerplate.

The first thing to keep in mind is that there are three ways to interact with searchd:

  • port 9306/tcp - Connect with a MySQL client
    • Good for troubleshooting because you can run SQL queries against your Manticore index, just like it was your database.
  • port 9308/tcp - Connect with anything that speaks HTTP(S) and treat it like a REST API
    • Using this one is my personal recommendation.
  • port 9312/tcp - Connect with Manticore's command line tools
    • Without this, CLI tools like indextool, spelldump, and wordbreaker won't work.
    • Just leave it turned on.

I configured all three of these so I had all of my bases covered.

In a Manticore index, every keyword in the index has its own buffer that contains which database entries the keyword was found in and a buffer that contains where in the row every instance of the keyword was be found. read_buffer_docs is the size of the former and read_buffer_hits is the size of the latter. I set both of these to 512k to compensate for latency between my server and the hosted database server. You might not have to do this.

client_timeout is how long searchd will wait in between searches before letting the connection to the search utility (whatever it is) drop. It defaults to five minutes but I set it to one second (1s) to reflect how infrequently searches would be run. In other words, I won't be running searches every few seconds but when I do run them I don't care if searchd goes back to what it was doing before I pestered it.

net_workers is the number of threads searchd will have handy to handle incoming network requests to the REST API. It defaults to 1 but I changed it to 12, one for each CPU in my server. Threads are lightweight so it doesn't hurt to have a few more than you need hanging around.

Next is a block of configuration data called source, where you define how to contact the database server, the SQL queries that extract the information you care about, and what queried information satisfies what searchd requires at a minimum. The one I use looks something like this):

source wallabag
{
    type            = mysql
    sql_host        = my-hosted-database-server.at.digitalocean.com
    sql_user        = joey
    sql_pass        = lovesexsecretgod
    sql_db          = wallabag
    sql_port        = 31337
    sql_query_pre   = SET CHARACTER_SET_RESULTS=utf8
    sql_query_pre   = SET NAMES utf8
    sql_query       = SELECT id as article_id, title, url, content, UNIX_TIMESTA
MP(archived_at) AS date_added FROM wallabag_entry
    sql_attr_uint   = article_id
    sql_attr_timestamp   = date_added
}

type is the type of database searchd is connecting to; I use MySQL. sql_host is the hostname of the database server, and sql_port is the network port to connect to. sql_user and sql_pass are pretty self explanatory. sql_db is the name of the database in the database server. The sql_query_pre lines (of which there can be more than one) are SQL commands that are executed before anything else. The two I have above tell MySQL to return text in the UTF-8 character set because that's what searchd expects.

sql_query is the actual query that searchd runs to get the text out of the database to index. As stated earlier there are only a few columns of the table wallabag_entry that we care about. The date and time are converted into a time_t value because that's the only format searchd can understand (plus it's portable across multiple platforms); it also expects there to be a particular field in the results called date_added but we make that happen by renaming archived_at with AS. The unique ID code of the saved article is stored as sql_attr_uint (a 32-bit long unsigned integer) because we'll need that to build the URL to click on to read the actual stored article in Wallabag. Similarly, sql_attr_timestamp explicitly says that the date_added field is a UNIX timestamp.

The final configuration block is index, and defines the parameters of the actual search index.

index articles
{
    type            = plain
    source          = wallabag
    path            = /var/lib/manticore/articles/articles
    stored_fields   = article_id, title, url, content
}

The index is named articles and is a plain index, which means that the index can't be changed, only rebuilt. This isn't a big deal for reasons I'll go into later.3 source is the name of the source block I just talked about. A Manticore configuration file can have multiple sources and indices. Also, a single source can have more than one index. path is a path in the file system where the index files will live. stored_fields is a list of fields returned by the SQL query in source that will actually be kept in the index file. We want this because when you search for something it's nice to get more than a message amounting to "I found it!" back.

Next, we need some way of running a search and returning the data to Searx. I wrote a shell script to do this. (Github) (Gitlab) (git.hackers.town) It depends upon two utilities that must be installed on your server, cURL to send HTTP requests to searchd and jq to pick apart the JSON that searchd returns with your search results. This was the hardest part, incidentally, because jq isn't an easy or intuitive tool to use. Broadly, searx-search_manticore.sh does five things:

  • build a search request
  • send the search request to searchd and capture the output
  • sort the returned search results from best to worst
  • extract up to the first 20 search results
  • format the search results in a way that's easier for Searx to work with

The reason I had to write a tool is because of a shortcoming in Searx that I haven't fixed yet due to a lack of time. searchd only accepts HTTP POST requests (wherein parameters are sent as fields in the HTTP request headers) but Searx can only make HTTP GET requests right now (where parameters are part of the URL contacted). Major impedance mismatch. While you can run MySQL queries directly on searchd and Searx does have a MySQL database connector, the problem here is that whenever the MySQL connection drops it doesn't go back up without restarting Searx. I don't know if this is an artefact of Manticore or Searx or what, only that it's a dealbreaker.

The third piece of the puzzle is to configure Searx so that it has a new search engine available (Manticore) and knows how to interpret what it gets. This is a two part process: Editing the configuration file and adding an HTML template to display the output. I wrote a shell script to do the actual work of querying searchd so we're going to use Searx's command line engine to run it and reformat the output. It took some trial and error to figure out because the documentation needs improvement (which, again, I haven't had time for) but the gist of it is this: The command line engine can in theory run a command anywhere on the system. However, whatever command you run is limited to wherever you have Searx installed, so if you play around with the examples in settings.yml they will only be able to reach files inside your checkout of the Searx source code (for example /home/pi/searx). It's a little annoying and I have a bug ticket open about it but so far nobody seems to have noticed. However, that limitation does not apply to where the command to execute is located.

Here's what the configuration stanza for Manticore looks like:

  - name: search manticore
    engine: command
    command: [ '/home/pi/exocortex-halo/scripts/manticore-and-searx/searx-search_manticore.sh', '{{QUERY}}' ]
    shortcut: wallabag
    tokens: []
    disabled: False
    timeout: 120.0
    delimiter:
        chars: ';;;'
        keys: [ 'article_id', 'article_title', 'article_score', 'article_url' ]
    result_template: 'wallabag.html'

Here, the {{QUERY}} bit represents the search term you supplied. I gave it the shortcut "wallabag" because that's what the search pertains to. I recommend making your shortcuts logical and reflective of what they're searching but do whatever makes sense to you. I made the timeout a generous 120 seconds because I'd rather have a long potential wait that I might never need than an overly optimistic shorter delay that might actually run out.

This configuration snippet is also in my Git repositories. (Github) (Gitlab) (git.hackers.town)

The delimeter part defines how Searx will interpret the output from the command. Searx seems to require that commands executed return only results on a single (possibly very long) line. Zero or more lines of results can be returned. Each line can consist of one or more fields that has a unique name. Those fields get plugged into the HTML template referenced by the result_template option. Because the result has to be a single line you have to pick a short string that separates the fields. In theory this can be anything, but it defaults (as far as we're concerned) to , a single space for the value of chars. That didn't really work for my use case so I made up my own, ;;;, which is a string that is highly unlikely to show up in any articles I have saved in Wallabag. The line is chopped up and mapped like this:

article_id 31337;;;article_title This is the title;;;article_score 999;;;article_url https://example.com/original/url.html

Almost done. Now we just need an HTML file for Searx to use to show us search results. I made a copy of the searx/searx/templates/oscar/result_templates/key-value.html file and modified it so that it would show search results from Wallabag in a way that's halfway pleasing to the eye. The file goes into the searx/searx/templates/oscar/result_templates/ directory.4 This template file is also in my Git repositories. (Github) (Gitlab) (git.hackers.town)

Once you have everything in place, restart Searx and run a search. If you did everything right and you actually have a few relevant articles in your database (when writing this article a few of my test searches really didn't exist, which gave me a bit of a scare) you should have a few links to stuff in your Wallabag install.

There is one more thing to take into account: Keeping your Manticore index up to date. I wrote a small shell script (Github Gitlab git.hackers.town) that rebuilds the index at least once every 24 hours, depending on how you set up the cron job to execute it. While updating an entire index seems like a big deal Manticore is amazingly efficient. As you can see, fully indexing over 20,000 articles of varying sizes takes less than 30 seconds.

That's about it. This isn't a 101-level procedure. I've tried to make it easier on folks by publishing the stuff I wrote but ultimately you have to be pretty comfortable with working on a Linux box and have a certain amount of knowledge of SQL, shell scripting, Searx, and generally tinkering with Linux machines. That was why I added all the relevant stuff to my exocortex Git repositories. (Github) (Gitlab) (git.hackers.town)

1 I'd love this to turn into a rant about how much I hate the idea of database migrations, in part because it confuses "moving your database to another server" with "updating the database schema" needlessly, and in part because it turns trying to figure out what the hell happened to your database into a game of Clue. In this particular case, it has to do with figuring out where in the source code the various setup operations are written. Because putting elementary operations like "set up my database" into a file called wallabag/src/Wallabag/CoreBundle/Command/InstallCommand.php is so much more intuitive than having a bunch of files with the extension .sql.

2 I've got a lot going on right now.

3 It's also the only index type that I got working after a few days of messing around.

4 If you don't use the Oscar theme you're on your own.