Using Recoll to index my hoard.

08 February 2022

Long time readers are probably familar with two things: Horror stories about my dental work, and my endless quest to find search software that'll let me make sense of my data hoard (because I never delete anything). Thankfully, the former's been fairly good lately so I don't have any real complaints there. Things have improved on the latter front, remarkably.

I've experimented off and on with a personal search engine called Recoll, which was designed to work alongside Linux desktop environments initially but later it was ported to Mac OS X and Windows. It is noteworthy in that it tries to be aware of as many different file formats as possible, from plain old text files like my .plan file to Microsoft Word documents. It can uncompress archived files to index their contents and it's even smart enough to detect when a file should be OCR'd using the open source Tesseract engine. You can even use Recoll to index your e-mail, if you're old-fashioned enough to keep it on your own computer.

Recoll is also packaged by many distros of Linux natively, so you should be able to just install it normally.

Rather than just diving into the deep end and using it until I ran into all of the problems I decided to do something different and start small. I installed Recoll on Windbringer (sudo pacman -S recoll tesseract id3lib), added the desktop app to my toolbar, and configured it to index my library (~/lib) and music collection (~/mp3). I also set it up so that Recoll's indexer (recollindex) would start up automatically by copying the one included in the package into my ~/.config/autostart directory and modifying it slightly:

[Desktop Entry]
Name=Recoll real time indexer
Comment=Runs in background to extract and index text from modified documents
Icon=system-run

# -w 60: Wait 60 seconds after startup before indexing.
# -m: Monitor the file system for new or deleted files and react accordingly.
# -c /home/drwho/.recoll: Look for configuration here.
Exec=recollindex -w 60 -m -c /home/drwho/.recoll
Terminal=false
TerminalOptions=
Type=Application
Categories=Utility;Filesystem;Database;
NoDisplay=true
X-GNOME-Autostart-enabled=true
X-KDE-autostart-after=panel
X-KDE-UniqueApplet=true

Much to my surprise it works really, really well on Windbringer. It didn't take very long to spider and build an index for the two library directories, maybe half an hour at most. I hasten to caveat this by saying that the reason building the initial index was so fast is because Windbringer has a really good solid state drive (a 2 TB Toshiba KXG50PNV2T04 NVMe). The last time I tried using Recoll Windbringer had a fairly slow (even for the time) conventional hard drive and indexing took ages. So, point the first: Fast hard drive, ideally a solid state drive. Cautiously hopeful, I installed the native OSX build on my work laptop, turned it loose with much the same configuration (only indexing work stuff), and put it through its paces over a couple of weeks. And it works pretty well there, too. So I then took the plunge and installed Recoll on Leandra.

I'll skip over the lessons learned (and there were quite a few, let me tell you) and only talk about what I've got working reliably.

First, if you've got a lot of data do not try to stuff all of it into one index. Create one index for each volume of data:

  • /home/drwho/.recoll/index-archive/ - My data archive.
  • /home/drwho/.recoll/index-Downloads/ - The directory I download stuff into before sorting through it.
  • /home/drwho/.recoll/index-lib/ - My reading and research library.
  • /home/drwho/.recoll/index-mail/ - Where my various e-mail accounts get backed up to.

This necessarily means that there are four config files and four copies of recollindex running at all times. There are also four distinct copies of the Recoll REST API server running, but I'll get to that later. What I suggest doing is this: Get just one indexer running with its own configuration directory (which is also where the index itself is kept). Also create one systemd user service file. When you've got it doing what you want (and that first index is fully built), then move on to a second one (if you need it) by copying and modifying the files that work appropriately, and so forth. Here are just the Recoll configuration directives I changed from default for my archive's index:

topdirs = /home/drwho/archive
monitordirs = /home/drwho/archive
textfilemaxmbs = 20
textfilepagekbs = 1000
membermaxkbs = 50000
maxtermlength = 50
indexstemminglanguages = english 
mboxmaxmsgmbs = 10240
idxflushmb = 512
filtermaxmbytes = 3072
idxlogfilename = /home/drwho/.recoll/index-archive/idxlog.txt
idxrundir = /home/drwho/.recoll/index-archive/tmp
pdfocr = 1

And because I have several of them, here is a .service file that will start up a recollindex daemon for ~/archive/:

[Unit]
# Use Description to document what the service file is for.  Seriously.  It's
# easy to forget and get confused.
Description=Recoll real-time document indexer: /home/drwho/archive
After=network-online.target

[Service]
Type=simple

# -m: Monitor the file system for new or deleted files and react accordingly.
# -D: Stay in the foreground to make it easier to manage with systemd.
# -x: Disable all X11 support.  You don't need this on a server.
# -w 30: Wait 30 seconds after startup before indexing.
# -c %h/.recoll/index-archive/: Look for configuration here.  Also put the
#   index constructed here.
ExecStart=/usr/bin/recollindex -m -D -x -w 30 -c %h/.recoll/index-archive/
Restart=on-failure

# DO NOT LEAVE THIS OUT
Environment=TMPDIR=/home/drwho/tmp

[Install]
WantedBy=default.target

So, what's up with that really important "Environment" line?

In .service files, the "Environment" directive means "when you spawn the process you're responsible for, set an environment variable with the name and value given." We do this here because when documents are being OCR'd the data needs to be written to a temporary directory, /tmp by default. The problem here is that for a lot of Linux distros (Arch in particular) the /tmp directory is actually a tmpfs that only exists in memory. If you have a respectable amount of data that needs OCR'd the last thing you want is to run out of RAM. This caused no end of trouble for Leandra until I figured out how to specify a directory to write tempfiles to. That is also how I found out that there's a bug where the temp directories that hold the OCR files (e.g., /home/drwho/tmp/rclmpdfgu5_9nix/) aren't erased when it's done, so you have to keep an eye on the temp directory and delete the oldest ones (automating this is left as an exercise to the budding system administrator). To give you an idea of how much disk space that can take up:

{17:09:52 @ Sun Jan 23}
[drwho @ leandra:() ~]$ du -sch tmp/rcl*
9.6G    tmp/rclmpdf0gr4dy71
18G tmp/rclmpdf0hj5smmc
23G tmp/rclmpdf0r43pn3f
...
5.7G    tmp/rclmpdfzr43xpff
4.0K    tmp/rcltmpfc3jNPo.gif
7.5T    total

(That's nowhere near the biggest my temp directory has been.)

Something I learned the hard way is that if you're storing data on conventional hard drives the OCR and indexing process is going to take time. How much time depends on the volume of data, how fast your CPU is, and most importantly how fast your hard drives are. This is why I recommend building only one index at a time, and partitioning the indices by... volume of data, I guess. When I set up the indices on Leandra I started with ~/archive/ first, and when that was done I did ~/lib/, and so forth. As with managing any significant volume of data the initial operation takes a very long time because you're starting from scratch. Once the index is built keeping it updated takes much less time. If you're going to do it right you need to be patient.

I think the kicker here was that I have Leandra's drives set up as a RAID-1 array. As you may or may not know, the operational properties of the various RAID levels differ because they use drives differently. In the case of RAID-1, reads from disk are very fast because when there are two copies of a file on two different drives the OS' kernel picks the drive that's the least busy to read from. However, when it comes to writing files to a RAID-1 the kernel has to write two copies on two different drives. This takes more time than it would on a single device. Additionally, because computers these days multitask by default other drive reads and writes are always happening and the additional disk activity bogs everything down. So, when you consider a process that's reading potentially thousands of files from a drive array in close succession, writing index files to disk, possibly running several copies of an OCR engine that are writing their own tempfiles to disk and reading them back for analysis... you've got time on your hands.

All told, it took about two weeks to build the indices. To give you a sense of why it took so long, here's how big just the indices are:

{22:13:02 @ Sat Jan 22}
[drwho @ leandra:() .recoll]$ du -sch */xapiandb
4.8G    index-Downloads/xapiandb
48G index-archive/xapiandb
17G index-lib/xapiandb
26G index-mail/xapiandb
95G total

Now the question to answer is "Okay, smartass cyborg - how do I use this to search my stuff?"

Recoll has an officially supported web interface written in Python which not only lets you run searches from your web browser it also has its own REST API so other software can interact with it. Here's where my favorite meta-search engine Searx comes in because it supports the official Recoll web interface's API as a first-class citizen. I can't say "no hackery" but "very little hackery" was involved to get it working. Here's how I did it:

First, install recoll-webui per the instructions. Then create yet another systemd* service file which both starts and configures the recoll-webui (because it doesn't use a configuration file, and maintaining multiple hacks of the same server is a pain). For example, here's the ~/.config/systemd/user/recoll-webui-archive.service file I use on Leandra:

[Unit]

# Again, document your .service files.
Description=Recoll Search web UI - /home/drwho/archive
After=network-online.target

[Service]
Type=simple

# This is where you tell recoll-webui where the index is.  You don't have to
#   put double quotes around the variable, but you can.
Environment="RECOLL_CONFDIR=/home/drwho/.recoll/index-archive/"

# -a localhost: Only listen on the loopback interface.
# -p 8082: Listen on port 8082/tcp for connections.  recoll-webui defaults to
#   port 8080/tcp.  Start there (if that port's not in use for something else)
#   and increment by one for each copy of the server you start.
ExecStart=/home/drwho/recollwebui/webui-standalone.py -a localhost -p 8082
ExecStop=/bin/kill -SIGINT $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=default.target

As one might reasonably expect, I have four copies of recoll-webui running on four different ports.

  • recoll-webui-lib.service - port 8080/tcp
  • recoll-webui-mail.service - port 8081/tcp
  • recoll-webui-archive.service - port 8082/tcp
  • recoll-webui-Downloads.service - port 8083/tcp

Now to configure Searx. As you saw earlier there are a few configuration stanzas for Recoll in Searx's config.yml file ready to go. What I recommend is making a copy of one of them (I went with the first one), uncomment it, and configure it to communicate with one of your copies of recoll-webui. When you've got it working, copy the stanza and modify it to work with a second one (if appropriate), and so on. I'll explain what the new lines mean in comments because a few of them are counter-intuitive.

# Give the stanza a unique name.  It has to be in all lowercase for some
#   reason.
  - name: recoll - library
    # This is the Searx engine to use.  It corresponds to
    #   searx/searx/engines/recoll.py
    engine: recoll
    # The URL that a copy of recoll-webui listens on.
    base_url: 'http://127.0.0.1:8080/'
    # Leave this empty.  You want Searx to search the entire domain.
    search_dir: ''
    # Location in the file system where the indexed files live.
    mount_prefix: /home/drwho/lib
    # If you want to access the files with your web browser, you need to
    # configure your web server to serve them up.  This is the URL to access
    # the datastore.  If you don't set this, you won't be able to actually
    # access any of the files, you'll have to log into the machine.
    dl_prefix: 'http://leandra/lib'
    # The ! command that tells Searx to only consult this search engine.
    shortcut: library
    # How long to wait in seconds for a search request to complete or fail.
    timeout: 120.0
    # You can set your own categories in Searx!
    categories: recoll
    # If you set this to True this makes it much easier to turn search sources
    # off without commenting them out.  You want it set to False, though.
    disabled: False
    # Searx always tries to use HTTPS to contact search engines.  Because
    # recoll-webui is running on the same machine and only listening on the
    # loopback, it's safe to use plain HTTP.
    enable_http: True

Of course, modify any of these lines as appropriate for your setup.

If you go to the Preferences page of your Searx install, on the Engines tab you should see a new search category called "recoll" that lists all of the Recoll web UI's that you configured.

Fun fact: You can create new categories in Searx just by naming them.

Now that you've done that, you should be able to use a DuckDuckGo-style bang shortcut to run a search on just one of your Recoll instances. Congratulations, you just searched your library's Recoll index using Searx.

Of course, and I couldn't think of anywhere else in this post to put it, use the systemd/User instructions I linked to earlier to start all of the services in series, from the top of this post to the bottom. I really hope you read all of the instructions before you sit down to spin things up.**

Now, go forth and make things happen.


* Why am I not calling it systemfail like I usually do? I'm tired and I want to write an article that's helpful and as un-confusing as possible. It took me quite a while to figure all of this out and if I can help other folks do this more easily, I want to. I'll resume my intense dislike of systemd later.

** Secret hacker trick #1: Read the manual first.