"MD5 considered harmful today"... but why?

Jan 01, 2009

If you've been following net.news in the past twenty-four to forty-eight hours you heard about what went down at the Chaos Computer Congress yesterday - a group of security researchers figured out how to exploit the flaws in the MD5 hash algorithm to forge CA certificates, thus placing SSL encryption as we know it in jeopardy.

...right? Breaking SSL is bad, yeah?

Like many things in life (and nearly everything in cryptography) it's not that simple or that straightforward. Yes, this is bad, but it's not "go back to punchcards" bad.

Let's take it step by step. First of all, here's the paper written by Alexander Sotirov, Marc Stevens, Jacob Appelbaum, et al. Regardless of your technical acumen you really should make an attempt to read it because the high points don't require a Ph.D in maths to understand. Also, the introduction has links over parts to skip if you're not Bruce Schneier. If nothing else, keep a copy of it open in another tab to refer back to.

Now we need to establish what exactly this research impacts... Some axioms are needed for the sake of this discussion: take it as given that a message digest algorithm takes an arbitrary input of some kind (it doesn't much matter what, it could be text, a Word document, or whatever else you can think of), grinds it up, and spits out a fixed-length string of letters and numbers. Message digest algorithms actually generate as their output long sequences of bits that don't necessarily correspond to text but I'm simplifying a bit. Also take it as given that message digest algorithms (let's call them 'hashes' henceforth; most everybody else does) are supposed to generate a unique output for every possible input you give them. This means that you can take a hash of a file that you know, generate a hash of a file you have, and compare the two to make sure that a) the file is really the one you're supposed to have, and b) the file you have hasn't been corrupted because hashes are sensitive to changes of even a single bit. They're good for making sure that files haven't become corrupted in transit or due to bad storage media or

At least, that's the theory; MD5's output is 128 bits (32 characters after encoding), which enumerates to about 3.4028237*10^38 possible unique hashes. This is a gargantuan number of possibilities but a finite number nonetheless.

Back in 2008, a couple of Chinese cryptographers discovered some ways to force collisions in the output of MD5 and a couple of other hashing algorithms. This means that they found more than one input to the algorithm which results in the generation of the same output. To give a slightly silly example, this is like taking an MP3 of White Roses v1.0 by Information Society and an MP3 of Never Gonna Give You Up by Rick Astley and discovering that they have the same hash value, so if you only went by the MD5 hashes you could substitute one file for the other and not realize it until you listened to the music.

From a cryptographic perspective they're important because they are an integral part of digital signatures. To add to our list of axioms for the moment, take for granted that there is something called public key cryptography, that it involves pairs of cryptographic keys, and that one of those keys is distributed to everyone while the other remains a secret.

One of the neat things about public key crypto is that whatever you encrypt using someone's public key, they can decrypt with their private key, and vice-versa. This handles a major problem in crypto (exchanging keys) but there is a nifty side effect: it lets you verify the identity (within certain limits) of whomever wrote something. A good example is posting the source code to an open-source project hosted at Sourceforge: if you're not particuarly paranoid you can just assume that the listed author of a project made an archive of the code available but you also have to assume that nobody has altered the code in any way (say, to add a backdoor, which has happened in the past).

However, if the founder of the project has a PGP key posted someplace to help establish their identity (usually on the project page) they can take a copy of the archive, encrypt it with their private key, and publish it alongside the archive of source code. People who then download both files can decrypt the encrypted archive with the maintainer's public key and use it to verify the integrity and authenticity of the source code. There's just one problem withthat method - you have to download the archive twice, and in many cases (say, with the source to the Linux kernel) the archives are tens of megabytes in size, and downloading that much data twice just to verify the integrity is a waste of time and resources.

Enter message digests. The maintainer uses PGP or GnuPG to make what is called a detached signature for the file, which means that the software makes a hash of the tarball and encrypts it with the user's private key. That detached signature is posted along with the tarball, so that all you have to do to verify the authenticity and integrity of the files is use PGP or GnuPG (for the record gpg -ab foobar.file will generate a detached digital signature, and gpg foobar.file.asc will verify it). Thus, if the signature ever doesn't match, or the hash fails, you know something wiggy happened to the files you downloaded at some point and should investigate.

(Those of you who are more knowledgable in cryptography are probably wondering why I didn't mention the PGP web of trust, stating trust in and signing public keys, and posting them to keyservers. We're already pretty far in the weeds insofar as the original topic of discussion and I'm trying to keep things on track as I can.)

Now we have to talk about something called PKI, the public key infrastructure. If you shop or pay your bills online you've probably made use of it and not realized it due to the fact that many sites encrypt communications by default using SSL these days. The PKI is basically an arrangement between a number of entities (usually banks, though domain registrars and the odd corporation have thrown their hats into the ring) which says that they will verify, to the best of their ability, the identity of whomever asks them to issue a cryptographic certificate. After verifying that their customers are who they say they are (which can involve anything from a background investigation to the customer paying them a couple of hundred US dollars) the entity in question will then generate a cryptographic certificate containing public and private keys for encryption and digital signatures and send them back to the customer to install in their web servers.

The entities that participate in the PKI (called CAs, or certificate authorities) have issued their own certs which they use to digitally sign the certificates of their customers. The public halves of those certs are included with every web browser and are updated periodically (to see them in Firefox, open the Preferences window, go to Advanced, click the Encryption tab, and click View Certificates; in Internet Explorer 7 click Tools, click Internet Options, go to the Content tab, click Certificates, and click the Trusted Root Certification Authorities). Whenever you use your browser to go to an encrypted website (say, Google Mail) the web server on the other side sends the public half of its certificate across the link as part of SSL negotiation; your web browser then verifies the signature(s) on the public cert from it's cache of certs from the root CAs. If all goes according to plan you'll get that little padlock icon in your status bar that we all know and love.

So basically, to make sure that the webmail site you just went to belongs to Google, your browser has to ensure that the certificate the website sent to your web browser was really cryptographically signed by Thawte Consulting, Ltd. Over yonder, there be dragons.

Now, you're probably wondering when I'll get to what was announced at CCC yesterday. This is it.

Sotirov, et al followed the process that most but not all websites follow: they submitted a certificate to a root CA through legitimate channels to be cryptographically signed (specifically, a root CA which still uses MD5 in the generation of their cryptographic signatures: Equifax's RapidSSL, FreeSSL, TC TrustCenter AG in Germany, RSA Data Security (horrors!), Thawte Consulting, and Verisign's branch in Japan). The cert they submitted was specially designed to collide, or have the same MD5 hash as another, legitimate cert; that legitimate certificate is not actually an SSL cert but a certificate used to sign the certs of other websites. The researchers then filed the digital signature off of the certificate they got back, cut the digital signature out of the intermediate CA cert they got from somewhere else (possibly the cache of one of their web browsers), and transplanted it into their own. Presto: instant rogue CA certificate, perfect for digitally signing any other SSL cert to make it look official.

It should be noted by the reader that it's trivial to fake most of the identifying information in an SSL certificate: the cert generation software asks for your name, address, company, division, and whatnot. Submitting a certificate with false information will be caught by most any CA but if you are your own CA it doesn't matter.

Now an attacker could not only use generate an SSL cert for a fake website but they could also use the rogue CA cert to sign it so that web browsers will accept the forged certificate without asking twice.

I should stress, however, that this is not an easy attack to mount. It would require a considerable amount of research and hardware for it to be practical. You need to be able to predict the serial number of the issued certificate an arbitrary period of time in the future as well as the timestamp of the issued certificate for it to work, which is best gotten by buying a number of certificates from the target and monitoring the relevant fields of the SSL certificates. A great deal of CPU time is necessary to generate an SSL certificate that has the same MD5 hash as an intermediate CA cert (on the order of several days), so you'll need to be able to predict the serial number and timestamp of an arbitrary certificate X days in the future. Hopefully, the CA you're targetting uses sequential instead of pseudo-random serial numbers... in addition to all of this, a considerable amount of jiggery-pokery has to be done to the rogue CA cert which involves an in-depth knowledge of the structure of X.509 certificates. You'll also have to have enough cash on hand to buy enough certificates at the right times to manipulate the serial numbers of certs issued by a given CA. Check out section 5.4 of the original paper ("Real life execution") - this isn't an easy attack by any means.

Oh, there's also one more thing: most CAs, including the big ones, stopped using MD5 in their digital signatures years ago. Probably shortly after the initial papers about forcing collisions in its output were published. They're using SHA1 (usually in addition to MD5) to generate the signed hashes.

I find it interesting that the computational heavy lifting (the brute-forcing of bit patterns that have the same MD5 hash as a known bit pattern) was done with commodity hardware... a cluster of Playstation 3 game consoles built by LACAL (Arjen Lenstra's Laboratory for Cryptologic Algorithms) in Switzerland. Total time to brute-force a collision: eighteen hours when 200 PS3s were used.

I guess that lays to rest at least part of an urban legend about Playstation 2's and 3's being considered munitions due to their processing power. Doubly so (for whatever the worth of the former statment is) when you take into account the fact that you can install and run Linux on the PS3.

One thing to keep in mind is that this is a proof of concept project: to the best of anyone's knowledge no one else has tried this sort of thing, which isn't to say that anyone won't in the future. This combination of vulnerabilities isn't being actively exploited right now, and hopefully the CAs that still use MD5 hashes in their digital signatures will migrate to something else in the very near future, like the SHA-2 or SHA-256 message digest algorithms.

I wonder if anyone will give any thought to setting up an underground rogue CA, the better to make a business off of phishermen. There are underground boards for selling stolen identities; there are underground businesses that sell malware and virus construction kits; you can even rent time on botnets. Why couldn't there be a CA that'll sell you a forged SSL cert for, say, Citibank or Paypal? Rather than use PS3s for the heavy lifting, why not use a botnet? That kind of distributed computing is perfect for this sort of thing.

This work by The Doctor [412/724/301/703] is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.