I'd beg the forgiveness of my readers for not posting since early this month, but chances are you've been just as busy as I've been in the past few weeks. Life, work, et cetera, cetera. So, let's get to it.
As I've mentioned once or twice I've been slowly getting an abscessed molar cleaned out and repaired for the past couple of months. It's been slow going, in part because infections require time for the body to fight them off (assisted by antibiotics or not) and, depending on how deep the infection runs it can take a while. Now I can concentrate on getting the molar in front of it, which has long been a thorn in my side, er mouth, worked on. Between being in close proximity to a rather nasty infection and the general stresses applied to molars during everyday life the seal on the crown broke at some point, leaving it somewhat loose and making squishing sounds when I chew. I don't know the extent of the involvement, but from coming home from work wiped out just about every night I'm starting to suspect that something nasty is going on in there also; it's a pattern that I've come to recognize over the years as suggestive of an immune response. There's a good chance that this particular pain-in-the-ass is going to need major repairs and, given how little of the original tooth is left (I lost count of the number of surgeries and root canals performed on it a couple of years ago) I'm pretty much resigned to losing the tooth entirely. I'll probably wind up getting an implant in its place if it does get pulled for the sole reason that it'l prevent the rest of the teeth in my mandible from slowly drifting to the fill in the space. Of course, if I do get an implant I'll try to stick a magnet to it and if it works I'll post the pictures.
Work has been keeping me busy lately. I've been working late hours here and there to get everything done, which hasn't left a lot of time to sit and write anything substantial, or even interesting. Not that I haven't been racking up links or anything. Since del.icio.us screwed everything up again I yanked my data out and installed a personal bookmark archive and search engine called Unmark, which has a couple of advantages and a couple of disadvantages. The major advantage is that I run it, I don't have to worry about a company that I have no influence over or control of screwing up or vanishing, leaving me up a small moving body of water without a wooden propulsion device. In point of fact that's pretty much the entire reason for using Unmark - I don't ask much of it, and I never asked much of del.icio.us to begin with, so there really isn't a down-side here. I did have to grab two files out of somebody's personal fork of Unmark so I could import my del.icio.us bookmarks (I spent a weekend trying to write a converter that would turn del.icio.us' Netscape-style bookmarks into the JSON import/export format Unmark uses and got frustrated with the inconsistency of the tag patterns) but once that was done I was up and running. I've taken down the link to my del.icio.us bookmark archive because I don't think anybody ever accessed it. At least, nobody ever told me that they ever went poking around in there so I didn't see the value in keeping it around.
In my spare time-slash-feel too lousy to go out time I've also spent some time rebuilding Leandra, whose drive arrays didn't survive the move a couple of years back. There are some technologies that I've wanted to teach myself for a while but standing them up in a virtual machine running at a cloud hosting provider would have been cost prohibitive (EC2 would have cost me a couple of hundred dollars a month, as would a Digital Ocean droplet large enough to hold my dataset). So I've been slowly accumulating hard drives since January of this year and I recently maxed out the RAM on Leandra's mainboard. Here's where she stands nowadays:
- 24 GB RAM
- 20TB disk space in a RAID-5 array
- SATA-3 on board
- 4TB hard drives x6
- four data drives
- one parity drive
- one hotspare
(Yes, my website's theme's CSS for lists is messed up - I really need to get around to fixing that.)
As some of you have probably guessed from Leandra's hardware specs I'm going to be teaching myself about big data technologies - how to search very large volumes of data, how to explore it, how to index it, and how to make sense of it. The data set I have in mind to experiment with is my e-mail archive... nearly twenty-five years of e-mail dating back to my early high school days, including the many floppy disks of QWKmail packets from dozens of BBSes that haven't existed since 1996.ev or therabouts (and let me tell you, the QWKmail file format is hair raising in a couple of places...) (and by the way, if you've never watched Jason Scott's BBS: The Documentary, make the time to do so!). I've been working on a utility which parses raw RFC-822 e-mail messages, extracts only the parts that I care about, and inserts them into an Elasticsearch server where they'll be indexed and where I can run arbitrary queries against them. The search interface of Elasticsearch is basically a REST API, so as long as my code speaks REST running queries should be a fairly straightforward process. It's going to take me a couple of weeks to uncompress and restore all of my archived e-mail, and turning all of that QWKmail into separate Maildir-compatible text files won't be easy, but once that's done I'll have a nice, big data set to experiment with.
I recently rewrote the web_search_bot/ part of my Exocortex because I got tired of having to reconfigure it every couple of days. Two of the search engines it was using as its data source, Ixquick and Startpage, merged into one functional entity and Startpage's HTML layout seems to change in subtle ways every few days (leading to lots of false positives) so I junked it in favor of referencing a local instance of Searx, which describes itself as "A privacy-respecting, hackable metasearch engine." To put it another way, it's a search engine which can query many other search engines, collate the results from them, and present the user with a unified search results page. It also has a very user-friendly search API, meaning that I was able to remove my own parser and use Python's built-in JSON module, saving myself a lot of work. If you want to play around with it there are many public instances that you can try out, such as this one (anonymized for your paranoia). You can stand up your own instance and configure it to use only the search engines you want (though the default configuration will give you very useful search results), you can add your own search engines if you want, and most importantly you can customize it however you want. I have an instance running on one of my Exocortex servers that has a more-or-less default configuration, acting only as a data source for web_search_bot/, and I have an instance running on Leandra which I'm using as a search provider for a couple of dozen Mediawiki sites as well as a YaCy search server running alongside Elasticsearch and Searx.
I have short-term-but-not-immediately plans to add a bot fo Exocortex which will accept URLs via XMPP messages and crawl the sites, effectively building a personal search engine of sites that I'm interested in or have personal use for, but it's going to take me a bit to get to that stage. Of course, when I've got it up and running I'll not only post the code but I'll write up a tutorial on how to do it yourself if you're interested in setting up your own.
What else is going on in life, you're probably asking yourself, when I'm not hacking? I was at MIT's Media Lab for a conference a couple of weeks ago, where I presented some of my work on Exocortex in its aspect of a cognitive prosthesis and not so much a data mining and information analysis engine. It was very well received, and I've got a bunch of notes typed up for things to work on, rework, and add to the system as a whole. How does a schmuck like me get to attend a conference at MIT, let alone twice? Beats me. I'm not looking a gift horse in the mouth, but neither am I ungrateful for the opportunity to work alongside some very interesting people, and come away with new perspectives and some new things to work with.
That's about all I've got this morning. In the very near future I'll start working my way through my collection of #blogfodder links and write a couple of articles that should be of interest to everyone. I'll also get back to writing those articles about Exocortex - promise.