As the title of this post implies, I've been working on some stuff lately that's been taking up enough compute cycles that I haven't been around to post much. Some of this is due to work, because we're getting into the really busy time of year and when I haven't been at work I've been relaxing. Some of this is due to yet another run of dental work that, while it hasn't really been worth writing about has resulted in my going to bed and sleeping straight through until the next day. And some of it's due to my hacking on a new project that wound up being... not as hard as I'd imagined it would be, but there certainly has been a steep learning curve.
Let's say that you have a library of several thousand albums, CDs, recordings, and other forms of media amounting to a couple of thousand files. Like any self-respecting mp3 collection it's questionably curated - the ID3 tags are inconsistent or missing, sometimes the album and song title are in the same field when they shouldn't be, sometimes all of the metadata is in the filename... you get the picture. However, let's say that you have a fragmentary database compiled by another piece of software that tries to be helpful and most of the time succeeds. By way of illustration, you have access to a number of data structures that categorize and index your media as best it can so that internally it looks like this:
genres = [ "worldbeat", "gothic", "techno", "heavy metal", "classical" ] artists = [ "Delhi 2 Dublin", "The Sisters of Mercy", "Psykosonik", "KMFDM", "Beethoven" ] albums = [ "Delhi 2 Dublin", "Floodland", "Psykosonik", "Symbols", "Beethoven's Ninth Symphony" ] songs = [ "Dil Nachde", "This Corrosion", "Silcon Jesus", "Megalomaniac", "Ninth Symphony" ] videos = [ "Liquid Sky", "Johnny Mnemonic", "Scent of a Woman", "Bladerunner", "Akira" ]
The question is, how do you search through the indices to see if you do or don't have something? People who've tried their hand at developing for Kodi have already recognized the situation and figured out where this is going - you can get Kodi to dump parts of its media database through its API but there isn't a way to search it directly. You have to do that part yourself.
So, in the interest of this example, let's do it the obvious way in Python:
>>> if "worldbeat" in genres: ... print "You have worldbeat music in your library." ... You have worldbeat music in your library.
Whee. But you and I both know that it's never that simple. For the purposes of this discussion, let's throw a few more constraints in: You can't rely upon being able to type accurately, and you don't necessarily know ahead of time if your search term is even present. The former case looks like this, and as often as not is due to autocomplete trying to be helpful:
>>> if "w0rldbear" in genres: ... print "You have worldbeat music in your library." ... else: ... print "Nope." ... Nope.
The latter case is part and parcel of having so much stuff, you don't really know what you have anymore:
>>> if "hard rock" in genres: ... print "You have hard rock music in your library." ... else: ... print "Nope." ... Nope.
Seasoned coders have no doubt twigged to a third problem: Most of the time, search algorithms are case sensitive, so searching for "Hard Rock" when you have "hard rock" in your media database is also going to result in "Nope."
Moreover, basic search algorithms are pretty literal minded, so this query is also a non-starter:
>>> if "Hard Rock" in genres: ... print "You have hard rock music in your library." ... else: ... print "Nope." ... Nope.
...but this is hardly the biggest problem. Just keep in mind that case sensitivity will trip you up when you least expect it.
Now let's talk a bit about how people search for things. Would you say that "sisters of mercy" and "The Sisters of Mercy" probably both refer to the same thing? If you scroll back and look at the contents of the list of artists, inductive reasoning tells us that yes, they probably do. There are many reasons we can say this is "true enough" (which is to say, close enough to make no practical difference), from the fact that one of the strings you matched against ("The Sisters of Mercy") are in a list of musical artists, that whether or not the words are properly capitalized doesn't really matter for stuff like this, and that stopwords like "the" don't actually add any information so we can ignore them for the purpose of searching. There is, however, a more formal way of stating this, which is to say that the strings "The Sisters of Mercy" (from the array called `artists`) and "sisters of mercy" (my search term) have a great deal in common with each other, and so have a high probability of being the same, or close enough that we don't care.
First, we need a metric of some kind to measure how similar two strings are. Let's say that every time a letter from the first string and a letter from the second string are exactly the same (same letter, same capitalization) we give that letter a score of 1. If a letter from the first string and a letter from the second string are slightly different (one is capital and the other is lowercase), we give it a score of 0.5. And if the two letters don't match at all we give it a 0. So, let's establish identity: Anything is identical to itself.
T h e S i s t e r s o f M e r c y T h e S i s t e r s o f M e r c y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 == 17/17
s i s t e r s o f m e r c y s i s t e r s o f m e r c y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 == 14/14
As you can see, when you compare a string to itself you'll always get the maximum score, which happens to also be the length of the string. Now let's compare the two strings that are sort of but not quite (to our eyes) the same:
T h e S i s t e r s o f M e r c y :17 s i s t e r s o f m e r c y :14 0 0 0 0.5 1 1 1 1 1 1 1 1 0.5 1 1 1 1 == 13/17
A few things become apparent: First, the maximum score will always be the length of the longest string being compared. This is due to the fact that two strings that aren't even the same length (counting characters) can never be identical - they'll never have the maximum possible score. Second, we'll frequently come close but not exact. We can work with close matches. Third, because fractions can always be turned into decimals by dividing the numerator by the denominator we can turn that into a percentage by multiplying by 100. So, let's calculate our confidence in this match: (13/17) *100 == 76.47%
Right off the bat we can be at least 76.47% sure that the two strings refer to the same thing. In context, we can be at least 76.47% sure that the search query got at least one hit in the database. I think that's close enough to see if I have a band's work on my media box. Now let's try it with two entirely different strings to show what a miss looks like:
P s y k o s o n i k :10 B e e t h o v e n :9 0 0 0 0 0 0 0 0 0 == 0/10
Sound childish? Way too simple? Maybe. This post is as much to sort out and process the last month as it is to leave a reference for somebody out there who's doing something similar. Never plant a date palm and expect to eat the dates from it, you know?
Let's look at one more subtlety of our earlier example, the probable hit in the database: See how I lined the characters up? All of the 1's jump out at you, which is fine because it's a good hit, but look at the zeroes: They were places where I had to manually insert a space to make the letters line up. Those 0.5's? Those represent very minor changes, flips of capitalization in this example. In other words, each of those things that weren't 1's represent changes that I made to the first string to get to the second one.
T h e S i s t e r s o f M e r c y :17 s i s t e r s o f m e r c y :14 ✓ ✓ ✓ ✓ ✓ :5
The final metric is 5 - the number of changes I made to turn the upper string into the lower string. This implies that strings that are substantially similar will require fewer mutations - addition of, deletion of, or substitution of characters. If similar strings require fewer edits, then it would stand to reason that the lower the number of edits, the greater the chance that the two strings may as well be the same. Let's look at another example:
D e l h i 2 D u b l i n :12 D e l i 2 d u b l i n :11 ✓ ✓ ✓ :3
I misspelled the name of the band - I typed 'Deli' instead of 'Delhi'. The distance between the two strings was also quite small, three edits, so again there's a very good chance that the two strings mean the same thing.
Now let's see what the negative case looks like - the strings aren't alike at all, so there's no match in the database (semantically, anyway):
M e g a l o m a n i a c :12 c l a s s i c a l :9 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ :11
To turn "Megalomaniac" into "classical" would require at least eleven edits, which implies that higher numbers of changes required imply that the two things aren't alike at all.
Now we can optimize things a little bit. Remember what I said earlier about case sensitivity? By normalizing strings before comparing them we can get a few more matches that we wouldn't otherwise have gotten, at the potential cost of lowering search accuracy in terms of what you were actually searching for (though maybe not its meaning). In this next example, "S/s" and "W/w" mean "This was capital but I forced it to lowercase" and the re-alignment of the letters is simply to make everything line up.
S/s c e n t o f a W/w o m a n :13 s c e n t o f a w o m a n :13 :0
As you can see, by assuming both strings are composed of all lowercase characters, the number of edits to turn one into the other is 0. Treating this as a database search, this means that we've got a perfect match. What I have just described is called the calculation of the Levenshtein distance between two strings. The greater the difference between two strings, the more different they are and vice-versa.