Text is Never ÅPlainÅ (Practice Safe Text, Part I)

I started a version of this post about a year and a half ago, when users of a CMS tool I'd written noticed strange characters in their documents. Quotation marks were showing up as foreign letters. The legal section symbol (§) was showing up as the degree sign (°).

You've probably seen this problem yourself in email or when copying and pasting web content. The text looked fine to the person who gave it to you, but to you, it looks like garbage. Why? As one of the clients using my application asked, "It's just plain text! What's the problem?"

The problem is that there is NO SUCH THING AS PLAIN TEXT. Whether you are a web user, author, or developer, you need to banish the phrase "plain text" from your vocabulary. Telling someone a document is "plain text" is like walking into a shop in Europe and expecting everyone to speak English. It's parochial and ignorant. I'm sorry to be blunt, but it is.

Mind you, the problem was my fault. My program was corrupting their documents because the web server I retrieved them from was telling me they were a certain kind of "plain text." But they were an entirely different kind of "plain text." Modern web browsers know better than to believe what they're told, so the documents looked OK when the users viewed them in their browsers. But after they went through my naïve application, their documents looked wrong.

My application violated Jon Postel's principle of robustness:

The implementation of a protocol must be robust. Each implementation must expect to interoperate with others created by different individuals. While the goal of this specification is to be explicit about the protocol there is the possibility of differing interpretations. In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior. That is, it should be careful to send well-formed datagrams, but should accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear).

(Internet Engineering Note 111: Internet Protocol Specification, John Postel, August 1979, Section 3.2)

It was also not really fair of me to blame the web server for sending incorrect content. It had also been lied to.

The Most Common Lie On the Web

No, it's not about transferring money for foreign royalty. It happens when a user opens a Microsoft Word document, copies it, and then pastes it into a web-based editing system. That lie causes lots of complications, and most modern web CMSs have built-in logic to detect and correct that lie.

Why is that a lie? Because the text copied from Word is in a format that's likely different from the format used in your web browser and web server, and therefore, in your web CMS. The lie happens when the web CMS accepts the pasted content without examining it, and then passes it on to the server as if it were in the format the server expects. The server is not likely to detect the error, and will from then on deliver the content incorrectly. And everyone involved probably, at some point, referred to the text as "plain text." Except they all meant something different by the phrase.

So why is text never plain? What's so complicated about storing a bunch of letters? Let me first recommend one of my favorite blog posts of all time: Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You have no business whatsoever calling yourself a web or CMS developer if you don't understand everything in that article, and then some. Plus, it's brilliantly written and quite funny. So if you're a developer, go there and don't come back until you've read and understood it.

Meanwhile, anyone else involved in a web project -- editors, project managers, designers -- still need to understand a bit (hahaha, get it?) about what text actually is. On a computer, text is not letters. A text file is bits, just like everything else on your computer. There is nothing inherent in your file that distinguishes it as a "text file" rather than a "program" or an "image." It's a text file because you said it was. Don't try this at home, but if you rename word.exe to word.txt and double-click it, Notepad will politely open up and show you an enormous file of garbage characters.

So how does the computer know to translate a particular set of bits into letters and numbers? The more knowledgeable among you may cry "ASCII!" Oh really? Are you sure? And if it is, do you know which ASCII?

ASCII and Ye Shall Receive. Hopefully.

ASCII is a forty-year-old standard that assigns a seven-bit sequence of bits to letters, numbers, punctuation and a few control codes. "K", for instance, was assigned 0010 1011. The more detail-oriented among you may notice that there are eight bits there, not seven. That's because on most computers, bits are grouped into "bytes" of eight bits.

"Most computers? What do you mean, 'most'?"

Well, that's the thing. Nitpicky technical types will scold me for calling that a "byte" instead of an "octet" since bytes are usually eight bits, but have been defined as four, five, six, seven or more bits on different architectures over the years. So if you are processing a text file created on one of those systems, you can forget about your ASCII. Admittedly, those systems (DEC and PDP minicomputers, for instance) are pretty rare nowadays, but they exist.

And even in the usual eight-bits-to-the-byte world, ASCII is by no means the only way to interpret bits as letters. IBM mainframes use a system called EBCDIC which everyone likes to ignore until they realize there's still an awful lot of COBOL in the world. (I had originally written "used," but a friend reminded me that her team of developers, who are working to convert mainframe programs to Java, are "are learning more than they ever cared to about EBCDIC.") Before the Macintosh, Apple used its own proprietary scheme which had no lowercase letters. (No, really.)

Meanwhile, our ASCII standard used only seven of those bits. The eighth ("high") bit was used for other purposes, and this is why to this day in FTP programs you have the option to transfer files as "text" or "binary." In part, "binary" means, "Don't mess with the high bit."

Seven bits is enough space to hold 127 values, or actually 126, since a byte containing only zeroes is known as a "NULL" and in many computer systems signifies the end of a text string. So you can't use it as a character. But 126 is plenty of space for the upper and lower case letters, the numbers, a bunch of punctuation symbols, and the various "invisibles" like tabs, carriage returns, and spaces. (0010 0000).

But what about ñ said the Spanish-speaking world? And what about ß said the Germans? And what about our whole damn alphabet said a lot of the rest of the world. The Hebrew and Arabic writers said ?su tuoba tahW. They all saw the eighth bit, said Aha!, and immediately started creating their own definitions for ASCII 128 and above. Except for the Chinese, Japanese and Koreans, who were still laughing uncontrollably at the thought of a language with only 127 characters, or 255 for that matter. They used (and often still use) two-byte encodings like Big5 and Shift-JIS.

So by the 1980s there were dozens of ways to interpret ASCII characters above 127, differing not only by language but by platform. Windows used one, while the Macintosh used another. Atarii had ATASCII and Commodore had PETSCII. They were all eventually standardized into the "Latin-1" encoding that most modern "ASCII" documents in the West use. But as its name implies, it's only useful for languages that use the Latin alphabet. There are dozens of related standards, called "code pages," for other languages, none of which are interchangeable.

One Encoding To Bind Them All

And so was born Unicode, which solves all of the above problems. Great! So just save documents as Unicode and all will be well, right?

Well ... no. Unicode isn't like ASCII. It doesn't assign a sequence of bits to a character. It assigns each character a number called a "code point." These can be four, five or even six hexadecimal digits long, and are usually written U+201C, where U+ indicates a Unicode code point, and 201C is the hexadecimal number assigned to the character. (One hexadecimal digit represents four binary digits, and it's easier to write 0x0A than 0000 0110.)

Unicode gives us the code point, but we need an encoding to translate that code point into a sequence of bits. (The separation of the logical code point from the physical representation of the character is part of the reason Unicode works so well, but it's frequently misunderstood.)

The vast majority of meaningful Unicode characters are four bytes wide or less, so perhaps the easiest method is to just use four bytes for every character to store the Unicode code point. Well, that doesn't work. Not only does it leave out the five- and six-byte characters, it also ignores the fact that computers don't all read bytes in the same order. What looks like 0x1001 to one system will look like 0x0110 to another. As Joel says, "and lo, it was evening and it was morning and there were already two ways to store Unicode." And, those hexadecimal numbers will include many zeroes, which translate as nulls and will break many systems and programs.

There is an encoding called UTF-32 (or UCS-4) that solves all of the above problems. But there's another problem. Decades and decades of ASCII documents become unreadable under this standard, and converting them will quadruple their size. Next.

A half-size version of the same problem happens with the UTF-16 encoding method, although Windows does use a version of it to this day. (And by the way, even though the UTF-16 standard does provide a way to handle five- and six-byte characters, many UTF-16 implementations choke on them.)

And so we come to UTF-8, the encoding most frequently used for Unicode text on the web. It starts with the clever idea of storing the traditional 127 ASCII characters in a single byte, so all of those old documents are valid UTF-8 without having to be converted, and new documents that use only those characters will be the same size they would have been with the ASCII encoding.

Other characters (including those that used to be encoded in ASCII 128 and above) are stored in anywhere from two to six bytes, depending on the type of character. This encoding is designed so that it's impossible to mistake a two-byte character for two single-byte characters. Problem solved! UTF-8 can encode millions of characters, and is compatible with ASCII.

Except for systems that have been built to assume that one byte = one character. When those systems start reading a UTF-8 document, they produce random gibberish. The 127 ASCII characters will come through unscathed, but a character that uses more than one byte will be translated into a sequence of junk characters. The opening double quotation mark, for instance, is a three-byte character in UTF-8. So instead of the quote, you'll see something like åÄú, or a question mark.

And that is where those garbage characters come from.

Tomorrow:: Part II: I Don't Need To Know This Technical Stuff!

Ken's blog
Login to post comments