Wednesday, November 20, 2013

Why email is hard, part 3: MIME

This post is part 3 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discuses internationalization. This post discusses MIME, the mechanism by which email evolves beyond plain text.

MIME, which stands for Multipurpose Internet Mail Extensions, is primarily dictated by a set of 5 RFCs: RFC 2045, RFC 2046, RFC 2047, RFC 2048, and RFC 2049, although RFC 2048 (which governs registration procedures for new MIME types) was updated with newer versions. RFC 2045 covers the format of related headers, as well as the format of the encodings used to convert 8-bit data into 7-bit for transmission. RFC 2046 describes the basic set of MIME types, most importantly the format of multipart/ types. RFC 2047 was discussed in my part 2 of this series, as it discusses encoding internationalized data in headers. RFC 2049 describes a set of guidelines for how to be conformant when processing MIME; as you might imagine, these are woefully inadequate for modern processing anyways. In practice, it is only the first three documents that matter for building an email client.

There are two main contributions of MIME, which actually makes it a bit hard to know what is meant when people refer to MIME in the abstract. The first contribution, which is of interest mostly to email, is the development of a tree-based representation of email which allows for the inclusion of non-textual parts to messages. This tree is ultimately how attachments and other features are incorporated. The other contribution is the development of a registry of MIME types for different types of file contents. MIME types have promulgated far beyond just the email infrastructure: if you want to describe what kind of file binary blob is, you can refer to it by either a magic header sequence, a file extension, or a MIME type. Searching for terms like MIME libraries will sometimes refer to libraries that actually handle the so-called MIME sniffing process (guessing a MIME type from a file extension or the contents of a file).

MIME types are decomposable into two parts, a media type and a subtype. The type text/plain has a media type of text and a subtype of plain, for example. IANA maintains an official repository of MIME types. There are very few media types, and I would argue that there ought to be fewer. In practice, degradation of unknown MIME types means that there are essentially three "fundamental" types: text/plain (which represents plain, unformatted text and to which unknown text/* types degrade), multipart/mixed (the "default" version of multipart messages; more on this later), and application/octet-stream (which represents unknown, arbitrary binary data). I can understand the separation of the message media type for things which generally follow the basic format of headers+body akin to message/rfc822, although the presence of types like message/partial that don't follow the headers+body format and the requirement to downgrade to application/octet-stream mars usability here. The distinction between image, audio, video and application is petty when you consider that in practice, the distinction isn't going to be able to make clients give better recommendations for how to handle these kinds of content (which really means deciding if it can be displayed inline or if it needs to be handed off to an external client).

Is there a better way to label content types than MIME types? Probably not. X.400 (remember that from my first post?) uses OIDs, in line with the rest of the OSI model, and my limited workings with other systems that use these OIDs is that they are obtuse, effectively opaque identifiers with no inherent semantic meaning. People use file extensions in practice to distinguish between different file types, but not all content types are stored in files (such as multipart/mixed), and the MIME types is a finer granularity to distinguish when needing to guess the type from the start of a file. My only complaints about MIME types are petty and marginal, not about the idea itself.

No, the part of MIME that I have serious complaints with is the MIME tree structure. This allows you to represent emails in arbitrarily complex structures… and onto which the standard view of email as a body with associated attachments is poorly mapped. The heart of this structure is the multipart media type, for which the most important subtypes are mixed, alternative, related, signed, and encrypted. The last two types are meant for cryptographic security definitions [1], and I won't cover them further here. All multipart types have a format where the body consists of parts (each with their own headers) separated by a boundary string. There is space before and after the last parts which consists of semantically-meaningless text sometimes containing a message like "This is a MIME message." meant to be displayed to the now practically-non-existent crowd of people who use clients that don't support MIME.

The simplest type is multipart/mixed, which means that there is no inherent structure to the parts. Attachments to a message use this type: the type of the message is set to multipart/mixed, a body is added as (typically) the first part, and attachments are added as parts with types like image/png (for PNG images). It is also not uncommon to see multipart/mixed types that have a multipart/mixed part within them: some mailing list software attaches footers to messages by wrapping the original message inside a single part of a multipart/mixed message and then appending a text/plain footer.

multipart/related is intended to refer to an HTML page [2] where all of its external resources are included as additional parts. Linking all of these parts together is done by use of a cid: URL scheme. Generating and displaying these messages requires tracking down all URL references in an HTML page, which of course means that email clients that want full support for this feature also need robust HTML (and CSS!) knowledge, and future-proofing is hard. Since the primary body of this type appears first in the tree, it also makes handling this datatype in a streaming manner difficult, since the values to which URLs will be rewritten are not known until after the entire body is parsed.

In contrast, multipart/alternative is used to satisfy the plain-text-or-HTML debate by allowing one to provide a message that is either plain text or HTML [3]. It is also the third-biggest failure of the entire email infrastructure, in my opinion. The natural expectation would be that the parts should be listed in decreasing order of preference, so that streaming clients can reject all the data after it finds the part it will display. Instead, the parts are listed in increasing order of preference, which was done in order to make the plain text part be first in the list, which helps increase readability of MIME messages for those reading email without MIME-aware clients. As a result, streaming clients are unable to progressively display the contents of multipart/alternative until the entire message has been read.

Although multipart/alternative states that all parts must contain the same contents (to varying degrees of degradation), you shouldn't be surprised to learn that this is not exactly the case. There was a period in time when spam filterers looked at only the text/plain side of things, so spammers took to putting "innocuous" messages in the text/plain half and displaying the real spam in the text/html half [4] (this technique appears to have died off a long time ago, though). In another interesting case, I received a bug report with a message containing an image/jpeg and a text/html part within a multipart/alternative [5].

To be fair, the current concept of emails as a body with a set of attachments did not exist when MIME was originally specified. The definition of multipart/parallel plays into this a lot (it means what you think it does: show all of the parts in parallel… somehow). Reading between the lines of the specification also indicates a desire to create interactive emails (via application/postscript, of course). Given that email clients have trouble even displaying HTML properly [6], and the fact that interactivity has the potential to be a walking security hole, it is not hard to see why this functionality fell by the wayside.

The final major challenge that MIME solved was how to fit arbitrary data into a 7-bit format safe for transit. The two encoding schemes they came up with were quoted-printable (which retains most printable characters, but emits non-printable characters in a =XX format, where the Xs are hex characters), and base64 which reencodes every 3 bytes into 4 ASCII characters. Non-encoded data is separated into three categories: 7-bit (which uses only ASCII characters except NUL and bare CR or LF characters), 8-bit (which uses any character but NUL, bare CR, and bare LF), and binary (where everything is possible). A further limitation is placed on all encodings but binary: every line is at most 998 bytes long, not including the terminating CRLF.

A side-effect of these requirements is that all attachments must be considered binary data, even if they are textual formats (like source code), as end-of-line autoconversion is now considered a major misfeature. To make matters even worse, body text for formats with text written in scripts that don't use spaces (such as Japanese or Chinese) can sometimes be prohibited from using 8-bit transfer format due to overly long lines: you can reach the end of a line in as few as 249 characters (UTF-8, non-BMP characters, although Chinese and Japanese typically take three bytes per character). So a single long paragraph can force a message to be entirely encoded in a format with 33% overhead. There have been suggestions for a binary-to-8-bit encoding in the past, but no standardization effort has been made for one [7].

The binary encoding has none of these problems, but no one claims to support it. However, I suspect that violating maximum line length, or adding 8-bit characters to a quoted-printable part, are likely to make it through the mail system, in part because not doing so either increases your security vulnerabilities or requires more implementation effort. Sending lone CR or LF characters is probably fine so long as one is careful to assume that they may be treated as line breaks. Sending a NUL character I suspect could cause some issues due to lack of testing (but it also leaves room for security vulnerabilities to ignore it). In other words, binary-encoded messages probably already work to a large degree in the mail system. Which makes it extremely tempting (even for me) to ignore the specification requirements when composing messages; small wonder then that blatant violations of specifications are common.

This concludes my discussion of MIME. There are certainly many more complaints I have, but this should be sufficient to lay out why building a generic MIME-aware library by itself is hard, and why you do not want to write such a parser yourself. Too bad Thunderbird has at least two different ad-hoc parsers (not libmime or JSMime) that I can think of off the top of my head, both of which are wrong.

[1] I will be covering this in a later post, but the way that signed and encrypted data is represented in MIME actually makes it really easy to introduce flaws in cryptographic code (which, the last time I surveyed major email clients with support for cryptographic code, was done by all of them).
[2] Other types are of course possible in theory, but HTML is all anyone cares about in practice.
[3] There is also text/enriched, which was developed as a stopgap while HTML 3.2 was being developed. Its use in practice is exceedingly slim.
[4] This is one of the reasons I'm minded to make "prefer plain text" do degradation of natural HTML display instead of showing the plain text parts. Not that cleanly degrading HTML is easy.
[5] In the interests of full disclosure, the image/jpeg was actually a PNG image and the HTML claimed to be 7-bit UTF-8 but was actually 8-bit, and it contained a Unicode homograph attack.
[6] Of the major clients, Outlook uses Word's HTML rendering engine, which I recall once reading as being roughly equivalent to IE 5.5 in capability. Webmail is forced to do their own sanitization and sandboxing, and the output leaves something to desire; Gmail is the worst offender here, stripping out all but inline style. Thunderbird and SeaMonkey are nearly alone in using a high-quality layout engine: you can even send a <video> in an email to Thunderbird and have it work properly. :-)
[7] There is yEnc. Its mere existence does contradict several claims (for example, that adding new transfer encodings is infeasible due to install base of software), but it was developed for a slightly different purpose. Some implementation details are hostile to MIME, and although it has been discussed to death on the relevant mailing list several times, no draft was ever made that would integrate it into MIME properly.