Project Byzantium: Sprint #1.
EDITED: 20110318 @ 0955 EST5EDT. See end of article.
A few weekends ago at HacDC a small team of highly skilled hackers gathered to work on practical solutions to a problem which has risen its ugly head time and again in the past few months: a lack of connectivity. Most of the time, when your DSL line goes dead for a couple of hours it's no big deal. If your phone service is tied into DSL (e.g., you're a voice-over-IP customer or the line is physically damaged) it's a bit more of a problem if you don't have an alternate means of communication (like a cellphone or a neighbor who isn't thus impacted). If an emergency situation arises in addition to a lack of connectivity (someone in your house runs out of insulin, somebody gets hurt, the power fails like it did during Snowpocalypse 2010 in DC), not being able to get word out that people need assistance can be dangerous. Or, if you live in Egypt, Tunisia, Iran, or Libya it could make the difference between freedom or having a messy and painful experience in a secret room that doesn't officially exist.
We decided that there are two major problems that had to be tackled, the Katrina problem and the Egypt problem. The Katrina problem (named for Hurricane Katrina, which all but levelled New Orleans in 2005) is when a natural disaster knocks out critical portions of the infrastructure, telecommunications among them. Land line telephones and cellular phones were useless because central offices were knocked out, lines were torn down, long distance links were damaged, and widespread power failures took down COs when uninterruptible power supplies eventually ran out of battery life or fuel. This is one of the reasons that relief efforts in NOLA had dangerously high clusterfuck quotients. A subset of the Katrina problem is the 9/11 problem, in which the telephone network becomes unreliable due to service load. 9/11 showed just how fragile the PSTN really is; when a network designed to provide service to 30% of its users gets more than that trying to place calls all at once, very few calls go through. The other big problem to be tackled was the Egypt problem, in which the integrity of the local Net is compromised in such a way as to prevent people from communicating with one another, which is precisely the situation we're seeing in the Middle East right now. So, the problem we're really trying to solve is, "How could a small cadre of hackers restore some degree of long distance communication capability to their community in the event things go pear shaped?"
While amateur radio is an excellent solution, the barrier to entry is high. To legally operate a ham radio you have to be licensed by the FCC, which means studying rather a lot and taking a test to show your proficiency with the information. While amateur radio equipment can be cheap, you'll pay a fair amount for decent equipment. It's also not exactly common. Plus, you need a decent antenana of some kind and that's not always feasible depending on where you live and how draconian your local homeowner's association is. All of these are prerequisites to actually getting on the air and getting some experience under your belt. Thus, one of our design goals was to build a solution which any reasonably knowledgable computer geek could hack together using stuff that is likely to be found in one's closet, junk box, dumpster, or children's toybox. We're sticking with open source software because a) it's freely available to everybody, and b) we plan on contributing any modifications we make back to the project so that everyone can use it. This dovetails with one of our operational goals, which is to make it easy to build and deploy so that many people will be able to use it in case of an emergency.
Part of the solution to these problems lies in wireless mesh networking, a technology in which wireless devices route traffic for each another co-operatively over a large area. Mesh networks are inherently decentralized so they don't screech to a halt whenever an access point goes down; if one node drops out nearby nodes will pick up the job of shuffling packets. Mesh networking makes it possible for nodes that are scattered across large distances (farther apart than the maximum range of wi-fi, which is a few hundred feet at best) to maintain connectivity. Also, wi-fi enabled devices are very common these days - most laptops, netbooks, tablet computers, and smartphones have 802.11a/b/g/n interfaces already, and if they don't you can pick up a USB stick which will let you access wireless networks for a couple of dollars at your local office supply store, computer store, Wal-Mega-Target-K-Mart, or even your local drugstore (just do your research first before buying that incredibly cheap netbook, those Sylvanias are junk). It won't take a whole lot of hardware to build a mesh because most people already have it. Almost as a fringe benefit, it is possible for some subset of nodes in a mesh to make services available to all nodes on the mesh (like wikis, file dumps, microblogs, and even gateways to other meshes or the Net). It should also be noted the mesh routing protocols are just that: routing protocols. They handle the fiddly part of making sure that nodes can route traffic automagically so you don't have to work out the tables yourself.
Enter Project Byzantium.
The first step in Project Byzantium was to define our use cases: we're assuming that there will be some number of devices in an area which are wifi-enabled. A subset of those wireless devices will be capable of mesh routing, probably in software that will have to be installed at some point (ideally ahead of time). Another subset of devices will be running services accessible to users of the mesh network (microblogs, forums, wikis, even BBSes). Another subset of mesh routers (zero or more) will be capable of gatewaying traffic to the Net, either with dialup to an ISP, a still-functional source of broadband, tethering to a cellular phone, or possibly a more complex solution involving a file server and USB keys carried by someone on a bicycle, skateboard, or rollerskates. There would also be a subset of mesh routers implementing long-distance links to other meshes because achieving total coverage of an area of a couple of blocks would be difficult in an emergency. Plus, it would mean less duplication of effort, less confusion (there doesn't have to be six wikis, only one or two), and a better chance that important information would travel far enough to be useful to the largest number of people.
We're also shooting for the services available on the mesh network to be simple to use from mobile devices with as short a learning curve as possible. People already use Facebook for organizing events, Twitter for near-realtime exchange of status updates, the Pirate Pad for collaborative composition and editing of documents, and wikis for user-maintained archives of information. We want to leverage as much previous user knowledge of these technologies as we can; if you've edited one wiki you'll be able to edit the one on the mesh without trouble, or at least that's the theory. If you can use Twitter you can post updates to an instance of status.net. Setting up user accounts would be strictly optional to get people active as fast as possible. For applications where anonymous access isn't immediately possible, pre-made accounts with well known credentials (mesh/mesh, cypherpunks/cypherpunks, telecomix/telecomix, dc/dc) could be set up and advertised. If those services could be run natively on smartphones rather than a laptop, so much the better.
We used a variant of the Scrum development methodology, in which we broke Project Byzantium into a number of sprints, or sub-projects. Each sub-project is defined by a certain period of time (one weekend) and has a defined overall goal, sub-goals that together comprise the overall goal, a schedule, and a plan to organize and prioritize effort. The idea is that if a big project is split into smaller problems it is not only easier to manage but is much less overwhelming for a small group of developers. It also organizes lessons learned in such a way that solutions to problems or fixes for bugs found in early stages can help forecast or inform solutions to problems in later iterations. For sprint number one our overall goal was to build two meshes and bridge them without making use of any existing communication infrastructure. Our design goals were that each mesh had to have at least one public service running and the nodes on one mesh had to be able to reach services running on the other. Each mesh had to have three to five nodes each, and nodes on one mesh had to be able to reach most-to-all of the nodes on the other (ping, portscans).
We gathered on Friday night to take stock of where we were, what we knew, and what resources we had available to us. All of us had at least one laptop and there were a couple of wireless access points floating around that were hackable but our skillsets didn't have much overlap. None of us had ever built a mesh network before so we were more or less flying blind. By the end of the evening it was decided that all of us should go home and do some reading up on the protocols that we decided to experiment with. And thus, we went our separate ways and spent the evening in "I know kung-fu" mode, in which we inhaled as much documentation as we possibly could so that we would have some idea of what would be happening the next day.
We decided to test drive two protocols in the first sprint and reserve the option to try something else in the event that either or both sucked too badly to be practical. The first is called Babel (notes) and was designed to run on top of ad-hoc wireless networks, though it'll also work on hardwired nets if need be. Babel runs as a daemon in userspace so it's pretty much install-and-go, with a minimum of configuration. The Babel daemon needs to be told specifically which network interfaces are available to it and will determine the best path through the mesh by keeping track of the number of hops required to reach a particular destination as well as signal quality from neighboring nodes. It's designed to handle mobile nodes primarily, so things won't go wonky as fast if one of the routers is strapped to a moving bicycle, for example. It's also designed to be agnostic insofar as the higher levels of the OSI model are concerned. If you had to you could run IPX/SPX or IBM's SNA protocols over top of Babel, though it's anybody's guess why you'd have to. It's also particularly lightweight for a mesh routing protocol because it transmits updates only when commanded to do so or when one of a handful of trigger conditions are met.
The other protocol we used is called BATMAN (Better Approach To Mobile Ad-hoc Networking) (notes), and does substantially the same thing as Babel but in a different way. It too relies on a decentralized model of routing, with no one node having knowledge of all of the possible routes. A BATMAN mesh takes a little time to get up to speed but does so by observing nearby traffic to detect other nodes in addition to listening for and recording "Hi, I'm here!" messages (broadcasted five times a second). BATMAN is also designed to be small, easy to use, and fast to deploy. Unlike Babel, the daemon will automatically make use of whatever network interfaces it detects; Babel has to be explicitly made aware of other routes, though doing so is trivial. BATMAN will also automatically announce to the mesh which nodes are gateways to the Net, and will also announce the maximum bandwidth available to each gateway for the purposes of determining optimal routes.
On Saturday morning we split into two teams, one running Babel on our laptops, the other BATMAN. We defined two mesh networks, one called hadcd-babel on channel 9 and the other called hacdc-batman on channel 8. To simplify things we didn't set up encryption, and in a real deployment event we wouldn't do that anyway to make it faster and easier for people to join the mesh. A source code repository on Github was set up to check whatever files and code we wrote into. Our basic wireless configuration involved turning down the wireless card, putting it into ad-hoc networking mode, configuring the ESSID and channel, bringing the wireless card back up, and then running the mesh routing daemon to join the device to the mesh. It's a bit involved but readily scriptable and well documented (Babel). After that, whatever servers would be running on a node would be started up, and if necessary re-running the mesh networking software in client mode would be done. We had no trouble getting Babel up and running, save for a little trouble on the Mac but a minor tweak to the Makefile took care of that. That aside, babeld compiled without incident and Just Worked(TM). Those of us on the hacdc-babel network were happily communicating with one another inside of five minutes. To dispatch configuration information to other users on that mesh I set up an AHCP server on Windbringer. It rapidly became apparent that, though is is far simpler to configure than a DHCP server it also lacks some of the features of a DHCP server which would have been nice to have (such as pushing static routes to Babel clients that weren't participating in routing, rather than Babel nodes which were). Also, all of the mesh nodes have to be running an AHCP client, which isn't common. DHCP is common and there is no reason that it can't be used on a Babel mesh (other than the fact that the config files for ISC's DHCP server are about as readable as rongorongo).
On the other hand, BATMAN proved problematic in the extreme. We had a difficult time getting it to compile and, even though it was running we couldn't figure out for the lives of us if it was working the way it was supposed to. The decision was made to use BATMAN-Advanced instead in the hope that it would suck less. BATMAN-Adv is implemented with a Linux kernel module, which does impact how many systems it can run on. However, because it is situated inside the kernel as a device driver it operates not on IP packets but raw network frames, which also means that it can route packets from clients of the mesh network that are not themselves running BATMAN-Adv without any routing table jiggery-pokery. BATMAN-Advanced was a little more involved to configure than Babel as its documentation was somewhat lacking but once we worked it out it wasn't that big a deal and the process for running it is also readily scriptable. The thing about BATMAN-Adv was that we weren't even sure until later how to determine if it was even working.
When you use BATMAN-Advanced it creates a virtual interface called bat0 on top of your existing wireless interface (usually wlan0). Once you have that stop messing with wlan0. Everything you do after that, like setting an IP address you do to bat0. The batctl utility (which must be installed separately) contains layer two analogues of ping, traceroute, and tcpdump specific to the BATMAN-Advanced protocol. Once you figure out how to use it, it's very handy, but it helps to be familiar with layer two. For this reason, I can't recommend it because it's a bit too fiddly to set up and use in a crunch. However, by the end of the day each mesh had at least one server running on it (hacdc-babel had an IRC server and hacdc-batman had a web server) and all the nodes of each mesh could see one another, so we achieved all of our goals for the day. We also discovered that some USB wireless devices really suck so experiment with them ahead of time to find the ones that don't. The 802.11 a/b wireless interface manufactured (or at least rebadged by Acme (no, really, that was the company's name)) was more trouble than it was worth.
Sunday's task involved bridging the two wireless networks so that members of hacdc-batman could contact members of hacdc-babel. We had a hacked Linksys router running OpenWRT with both Babel and BATMAN-Advanced running at the same time. The theory was that the hacked router could flip packets from one mesh to the other, because all those two mesh routing protocols were doing was adding, subtracting, and changing entries in the routing table. That's one of the things that routers are designed to do, after all: protocol translation. This proved problematic also: not knowing ahead of time what OpenWRT had named the packages a bit of shooting in the dark was required before we figured it out. Also, I'm not familiar with OpenWRT so I was trying to learn it as I went along (hint: RTFM). Also, Linksys routers are very underpowered when compared to your average netbook or even smartphone. Be patient with them because operations that take a second or two on a laptop might take a few minutes on an embedded system. Ultimately, we did eventually get the two meshes hooked together but it was something of a struggle.
Before we can consider Project Byzantium a success, there are a couple of problems that will have to be solved. First of all, getting nodes that aren't running mesh routing software onto the mesh as clients is still something of a challenge. Ultimately, so long as they are configured for ad-hoc networking and are associated with the same wireless network it should be possible. It's a matter of the mesh nodes having static routes for the mesh clients at the 'edge' of the network. The IP addresses chosen by the clients are fairly predictable; a lot of networked devices these days, if they don't get a response from a DHCP server will default to something in 169.254.0.0/16, so mesh nodes will have to have routes in place. Seeing as how we plan on scripting the operations to start and stop mesh routing functionality this is trivial. Similarly, distributing IP addresses to nodes joining the mesh might be a bit problematic. Because much of the magick is in the routing if people pick their own IP addresses it won't be much of a problem and the mesh networking software won't even blink so long as the routes are in place (we tested this). Associating hostnames of wireless devices with the IP addresses they've either chosen for themselves or been given might prove problematic due to the number of people who never give their smartphones unique names. This dovetails with a larger problem, which is that in the deployment scenarios we're working with there won't be any DNS infrastructure because there won't be much net.access for anyone.
A mesh network that is set up will have to have not only its own DHCP servers but dynamic DNS servers. Whenever a node is given a DHCP lease and associated packet of information (and you can disseminate quite a bit of information using DHCP), it will have to then contact the DNSes it's configured to use to register its hostname and IP address. The DNS zone files for that mesh network would then be automagically updated. By making a loosely defined yet authoritative DNS zone (like hacdc.mesh), in theory DNS servers in neighboring meshes could then query them for inter-mesh communications. The problem there is that if the DNS server for a mesh goes down, everybody's more or less dead in the water unless the users start memorizing IP addresses (and that doesn't happen). A few of us are kicking around ideas for a lightweight, distributed peer-to-peer DNS which could run on every mesh node. The idea is that it would be based around a distributed hash table. Every mesh node would, in addition to running a routing daemon, also need to run a distributed DNS client. If a mesh client had chosen its own IP address it would record that hostname and IP address locally in addition to distributing it to some subset of mesh nodes by inserting it into the DHT. If a mesh node gave a client its configuration it would also record that information in the same way.
Ultimately, it would be cool if we could make installable packages for smart devices (iPhones, iPods, Android devices) that some people could install without having to jailbreak/root their units, but I don't know how feasible that would be due to the fact that setting up a mesh network involves having read-write access to the routing table. What it may wind up being is that we'll have to get as many geeks involved in this project as we can so that a lot of mesh networks spring up. Of course, people are going to find their own uses for them and will no doubt set them up without there needing to be a crisis of some kind, which would be downright awesome to see.
Another problem we're working on is that of telling people what services are running on the mesh network once it's up and running. People find websites because you hear about them somehow. People sometimes find file dumps by poking around in the Network Neighborhood of Windows. People hear about social networking websites by talking to people. There are programmatic ways of going about it like Apple's Bonjour and Avahi but they have limited utility at present; it's possible to write service descriptors for things like e-mail servers and websites on intranets but nobody really does it. Mostly it's used for finding printers, the odd NAS, and iTunes seems to use it on the back end for certain things (disclaimer: Not an Apple user). So, while it could be done I have my doubts that it'd be worth the effort.
QRcdes are finally catching on and can be seen everywhere from posters on the street to business cards to magazine articles. It would be entirely possible to generate a couple of QRcodes with URLs beforehand, print them out, and keep them with your mesh networking kit to stick up should the need arise. It would also be easy to just use chalk to write messages in public places: "INTERNET DOWN. SET TO AD-HOC NETWORK, ESSID 'MESH', CHANNEL 2, HTTP://WWW.MESH/". The method used to bootstrap users of the Tor darknet is often the Hidden Wiki, on which a rather large number of services are documented. There is no reason that a couple of mesh hackers couldn't set up and advertise a wiki which serves as a user-curated repository of links to resources on their mesh network. Another problem which will have to be tackled is that of insufficient coverage: if not everyone is running mesh networking software but can still use the mesh, how will the network grow? The answer to that is that multiple meshes will have to be able to communicate with one another. That's a problem that we're working on for our next sprint.
Also, I was trying to learn IPv6 on the fly. I figured that since a lot of OSes will automagically pick IPv6 addresses, we could get away without having to use DHCP, and it would trivial to configure BIND to accept DNS updates over IPv6. I was wrong. It doesn't actually work that way. Go ahead and laugh.
Another thing that I'm concerned about is adoption. How would we grow a mesh if we had to? How fast could we grow a mesh if we had to? We'd have to make it possible for some subset of users to configure their systems appropriately (possible with scripting, but may require jailbreaking or rooting smartphones), install the necessary software, and get up and running. Where would the necessary software come from? Good question, it would have to have it handy already. That's not a big deal assuming ten minutes of preparation ahead of time to have some CDs and USB keys in a jump bag and talking to the people around you to get them on board. Also, this project is aimed at geeks who can assemble a mesh out of stuff they have in their junkbox more or less at the drop of a hat. An idea I had involved a variant of the Pirate Box that has been mesh-enabled which contained relevant software for Windows/distributions of Linux/MacOSX/smartphones and instructions for installing and activating same.
Ultimately, we'll have to talk to people around us to get them to join. You can't just set something like this up and not tell people about it. For the less-socially inclined of us (that includes me) this means talking to random people and getting them to join, and more importantly telling them what's there and how it can be used. Networks don't work unless people find out about them and can do useful things with them.
I would be remiss if I didn't mention any security concerns with either of these protocols. Babel comes right out and says it, BATMAN-Advanced doesn't, but I'll say it: these protocols were not designed for security. They were designed to get nodes routing packets for one another and that implies listening to, grabbing, and relaying packets whenever and however they hit a network interface. They were meant to get nodes on the mesh doing suff as fast as possible. They're about as secure as ARP so keep this in mind. If you want your traffic to be unreadable you'll have to use SSL or TLS for any services you access or you'll have to trust in the fact that nobody knows who you are online (unless you use your real name someplace).
Eventually, we want to come up with a free-as-in-speech and free-as-in-beer implementation which we can publish far and wide that everybody can set up, hack on, and improve. The way we see it, radio hams have field days to practice their skills, why can't hackers have something similar for mesh networks? Ideally, we'd like to be able to hook multiple hackerspaces together over long distances but that goal is a way off, and we're not quite there yet. But we're working on it.
EDIT: Last night at HacDC I was informed by Redbeard that, as of 16 December 2010 the BATMAN-Advanced mesh networking protocol has been integrated into the mainline Linux kernel. Version 2.6.38 of the Linux kernel was officially released on 15 March 2011, so keep an eye out for a kernel update for your distribution (or compile it yourself if you've a mind to). Here is the official BATMAN-Advanced project announcement.
Here are some pictures taken at the sprint.
This work by The Doctor [412/724/301/703] has been placed under the Creative Commons By Attribution / Noncommercial / Share Alike 3.0 License.