Website file integrity monitoring on the cheap.

30 May 2017

A persistent risk of websites is the possibility of somebody finding a vulnerability in the CMS and backdooring the code so that commands and code can be executed remotely.  At the very least it means that somebody can poke around in the directory structure of the site without being noticed.  At worst it would seem that the sky's the limit.  In the past, I've seen cocktails of browser exploits injected remotely into the site's theme that try to pop everybody who visits the site, but that is by no means the nastiest thing that somebody could do.  This begs the question, how would you detect such a thing happening to your site?

I'll leave the question of logfile monitoring aside, because that is a hosting situation-dependent thing and everybody has their own opinions.  What I wanted to discuss was the possibility of monitoring the state of every file of a website to detect unauthorized tampering.  There are solutions out there, to be sure - the venerable Tripwire, the open source AIDE, and auditd (which I do not recommend - you'd need to write your own analysis software for its logs to determine what files, if any, have been edited.  Plus it'll kill a busy server faster than drilling holes in a car battery.)  If you're in a shared hosting situation like I am, your options are pretty limited because you're not going to have the access necessary to install new packages, and you might not be able to compile and install anything to your home directory.  However, you can still put something together that isn't perfect but is fairly functional and will get the job done, within certain limits.  Here's how I did it:

Most file monitoring systems store cryptographic hashes of the files they're set to watch over.  Periodically, the files in question are re-hashed and the outputs are compared.  If the resulting hashes of one or more files are different from the ones in the database, the files have changed somehow and should be manually inspected.  The process that runs the comparisons is scheduled to run automatically, while generation of the initial database is normally a manual process.  What I did was use command line utilities to walk through every file of my website, generate a SHA-1 hash (I know, SHA-1 is considered harmful these days; my threat model does not include someone spending large amounts of computing time to construct a boobytrapped index.php file with the same SHA-1 hash as the existing one; in addition, I want to be a good customer and not crush the server my site is hosted on several times a day when the checks run), and store the hashes in a file in my home directory.

Here are the commands I use to build (and periodically update) the database:

find -type f -not -path "*" \
    -exec sha1sum {} \; > ~/
chmod 0400 ~/
  • The find command runs through the entire directory structure I give it ( which contains my website.
  • The find command skips every file in the directory tree* because that is where the CMS I use caches the HTML pages it generates.  This skips several thousand files that change every hours or so, which result in a mountain of false positives.
  • For every file that matches, the sha1sum utility is run on it.  The output is stored in the file ~/

(Ye gods, I need to fix the CSS for bulleted lists in my theme, that's a mess.)

To check the integrity of my website, a couple of times a day a cron job runs which re-hashes everything in my website.  The database is formatted in such a way that the sha1sum utility can read it, reference the file paths, and verify that the stored hashes match.  I have this code in a file called


# See if the database exists.  If it doesn't, ABEND.
if [ ! -f $DATABASE ]; then
    echo "ERROR: File hash database missing.  Terminating."
    exit 1

# Peek at the permissions on the database.  If they're not 0400, something's
# wrong.

echo "Checking integrity of files."
sha1sum -c $DATABASE | grep -v 'OK$'

exit 0

Files that match result in the output of "OK" and are elided.  Files that don't match are collected and e-mailed to me automatically.  This is what some of the entries in my file hash database look like as of my writing this article.  They're pretty intuitive:


I would be remiss if I did not state that this approach has a couple of drawbacks, chief among them that it doesn't detect new files added to my site.  This means that a brand-new file could be added someplace (like a PHP webshell) and I wouldn't know it.  Due to the fact that there are only so many webshells out there, and everybody rips off everybody else's, it would be fairly easy to store a database of hashes of webshells and detect their presence every time the site check runs.  This approach also means that there could be a free-for-all happening in the app/cache/ directory and I wouldn't necessarily know about it.  Because files stored in cache directories all have consistent filenames and locations for speed and efficiency (that's what they're for, after all) it would be fairly easy to automatically compare all of the files in the on-disk cache with the known and documented naming schema for Bolt CMS and pick out anomalies.