Parsing simple commands in Python.
A couple of weeks ago I ran into some of the functional limits of my web search bot, a bot that I wrote for my exocortex which accepts English-like commands ("Send me top 15 hits for HAL 9000 quotes.") and runs web searches in response using the Searx meta-search engine on the back end. This is to say that I gave my bot a broken command ("Send hits for HAL 9000 quotes.") and the parser got into a state where it couldn't cope, threw an exception, and crashed. To be fair, my command parser was very brittle and it was only a matter of time before I did something dumb and wrecked it. At the time I patched it with a bunch of if..then checks for truncated and incorrect commands, but if you look at all of the conditionals and ad-hoc error handling I probably made the situation worse, as well as much more difficult to maintain in the long run. Time for a rewrite.
Back to my long-term memory field. What to do?
I knew from comp.sci classes long ago that compilers use things called parsers and grammars to interpret code so that it can be converted into an executable. I also knew that the parser Infocom used in its interactive fiction was widely considered to be the best anyone had come up with in a long time, and it was efficient enough to run on humble microcomputers like the C-64 and the Apple II. For quite a few years I also ran and hacked on a MOO, which for the purposes of this post you can think of as a massive interactive fiction environment that the players can modify as well as play in; a MOO's command parser does pretty much the same thing as Infocom's IF parser but is responsive to the changes the user's make to their environments. I also recalled something called a parse tree, which I sort-of-kind-of remembered from comp.sci but because I'd never actually done anything with them, I only recalled a low-res sketch. At least I had someplace to start from so I turned my rebooted web search bot loose with a couple of search terms and went through the results after work. I also spent some time chatting with a colleague whose knowledge of the linguistics of programming languages is significantly greater than mine and bouncing some ideas off of him (thanks, TQ!)
But how do I do something with all this random stuff?
Let's start with the basics, because this is a complex topic and the layers of subject matter stack up rapidly. I'm going to stick with informal definitions of terminology so I don't make your eyes bleed, but give enough references to more authoritative sources that you can get more detail if you wish.
Lexical analysis is the process of taking a sequence of symbols (usually words, but they can be commands in a programming language or a sequence of abstract symbols) and breaking the sequence into a list of tokens (smaller sequences of symbols that have a meaning) (the technical term is lexeme). Sometimes each token's meaning is attached to the token in question (like a tag), sometimes it isn't. After lexing is parsing, the act of taking the list of symbols and figuring out whether or not it's a valid sentence in a formal grammar, and if so determining what to do with it. A formal grammar is basically a document that defines the list of possible symbols in a language (letters, numbers, punctuation marks), possible strings ("e-mail", "top") and sometimes components ("http://") that can be composed of those symbols, and the rules which govern valid strings (for convenience I'll call them sentences) which can comprise the language specified by the formal grammar. If a sentence can be accurately represented by the formal grammar it's a valid sentence, otherwise it's not (it doesn't parse).
One of the problems I kept running into was how to represent the commands I wanted to send my searchbot as code. As you can see from my first and second attempts, using a bunch of if..then conditionals and deleting elements from arrays from left to right is far from the best way of going about this particular task. So I sat down with the Wikipedia page for something called the Extended Backus-Naur Form, which is a way of writing formal grammars in an easy to implement way. Using EBNF I started writing a grammar for my command language; it looks like yet more line noise but it's really quite straightforward: Start with defining the set of all letters, then all digits. Numbers are comprised of one or more digits. Whitespace (spaces, tabs, what have you) are already represented as a built-in, so I just gave them an easier to use name. Punctuation marks are pretty self explanatory. The EBNF I wrote isn't complete because once I figured out how think about the commands I was able to mock up my examples in a separate script, debug them, and transplant them into my searchbot's source code. I knew that I was crap at writing my own parsers but then again people who write compilers don't write their own parsers or lexers directly, they use tools like flex, yacc, and bison, which strongly implied that the best way of proceeding meant finding and using a parsing tool that somebody much better at this than I am wrote.
Back to threshing through search results. After some false starts I decided to play around with PyParsing, which is a Python module that lets you not only define a grammar but implement a parser directly as Python code. It may not be useful for writing a more general purpose command parser (say, if I wanted to implement a Zcode interpreter in Python) but for giving orders to bots it seemed like an ideal tool. I went looking around for examples and found this rather comprehensive documentation, which I downloaded as a PDF (local mirror) (CC-BY-NC unported v3 license) to Windbringer and my Necronomipad to read through a couple of times during the week.
I started with something simple, asking the bot for online help: "help"
The help command is a single word, punctuation and whitespace are stripped off of the front and back of the command automatically, and it should be case-insensitive. The parser that handles help requests looks like this:
help_command = pp.CaselessLiteral("help")
try:
parsed_command = help_command.parseString(command)
handle_help_command()
return
except pyparsing.ParseException as x:
print "That wasn't a request for help. {0}".format(str(x))
If the parser called help_command matches, a function called handle_help_command() (just pretend that it displays the online help) runs and then the return statement goes back to the code that called it, which listens for commands from the user and sends input to the command parsers. If it doesn't match it'll throw an exception (meaning "This didn't match.") The parse attempt is inside of a try..except block so an exception isn't fatal because it will be caught and handled by except, which prints a message. Here's what happens when I give the parser the command "help":
['help']
Here's what happens when I give the parser a command that isn't "help", like "Fooble":
That wasn't a request for help. Expected 'help' (at char 0), (line:1, col:1)
In other words, the parser expected the string "help" in its input. Remember that particular construct because we're going to see it again later. Now let's define the parsing primitives and grammar for the command "get top [ number ] hits for [ some search term ]." We can break that sentence down like this:
The boxes mean that each of the discrete tokens in the above image can be represented by a separate parsing primitive. Using separate primitives means that they can be re-used later to make new sentences that match the grammar. Here's what they look like:
get_command = pyparsing.CaselessLiteral("get")
top_command = pyparsing.CaselessLiteral("top")
results_count = (pyparsing.Word(pyparsing.nums) |
pyparsing.Word(pyparsing.alphas + "-")).setResultsName("count")
hitsfor_command = pyparsing.CaselessLiteral("hits for")
search_term = pyparsing.Word(pyparsing.alphanums + "_,'-")
search_terms = pyparsing.OneOrMore(search_term)
The parsing primitive results_count can match either numbers (23) or words that represent numbers (twenty-three) using some code I wrote a while ago that converts the latter into the former (here and here, in case you're interested). The vertical bar ( | ) in results_count is a logical-or, meaning "either the part before the vertical bar, or the part after the vertical bar." The .setResultsName("count") part of results_count is a keyword I can use to extract the number of requested search results from the parsed sentence, thusly: number_of_search_results = parsed_command["count"]. search_terms matches the actual search terms I send the bot, and is comprised of one or more matches for the parsing primitive search_term, which matches a token comprised of letters, numbers, underscores, commas, single quotes, or dashes. So, search_terms will match this:
hal
or this:
hal-9000 quotes
or this:
Open the pod bay doors, please, HAL.
Now to stack the parsers in such a way that if one doesn't match the next will be attempted, and any parsers following that one. Here's how I actually do it in the code for web_search_bot.py for reasons too lengthy to go into here, but conceptually speaking here's how it works:
get_command_parser = get_command + top_command + results_count + hitsfor_command \
+ pp.Group(search_terms).setResultsName("searchterms")
send_command_parser = send_command + destination + top_command + results_count \
+ hitsfor_command + pp.Group(search_terms).setResultsName("searchterms")
try:
parsed_command = help_command_parser.parseString(command)
handle_help_command()
return
except pp.ParseException as x:
print "That wasn't a request for help. {0}".format(str(x))
try:
parsed_command = get_command_parser.parseString(command)
handle_get_request(parsed_command["count"], parsed_command["searchterms"])
return
except pp.ParseException as x:
print "That wasn't a request to get search results. {0}".format(str(x))
try:
parsed_command = send_command_parser.parseString(command)
handle_send_request(parsed_command["count"], parsed_command["searchterms"])
return
except pp.ParseException as x:
print "That wasn't a request to email search results. {0}".format(str(x))
# This code runs if none of the above parsers match.
print "I'm sorry, I didn't understand any of those commands."
print "Please try again, or enter the command 'help' for assistance."
return
So, all I have to worry about is whether or not a given parser matched the input, and if it did match send parsed_command to a function that knows how to handle that particular kind of search request. When the function that handles a parsed_command completes, the code uses return to bounce back to the waiting-for-commands part of the bot's code.
That's pretty much it in a nutshell. The above block of code can be extended arbitrarily to handle more kinds of search requests if it needs to be. If you're interested in looking at the codebase for web_search_bot.py it's in my exocortex-halo Github repository.
Happy hacking.