Skip to main content

Practice Safe Text, Part II: Yes, You Need To Know This Technical Stuff

(This is a continuation of yesterday's Text is Never ÅPlainÅ)

Why should you, the product manager, or editor, or designer, care about things like text encoding? For the simple fact that it can ruin your life. I can almost guarantee that your web site and CMS are full of frightening assumptions about text files. You know when you'll find them? When you decide it's time to present your site in another language, or you start getting content in other languages, or even English content from people using computers in other countries.

Or you'll try to migrate from one CMS to another and run into endless problems bringing over the old content. Some people don't even bother to try (which to my mind is inexcusable). I've done several large content migrations over the last few years, each of which involved more than a decade's worth of content. That means a wide variety of text encodings -- old and new Macintosh, and multiple Windows formats as well as the bastardized results of incorrect conversions among those standards.

If you are working with content from a mainframe you may have bigger problems, and if your content has been poorly converted in the past, you'll have still bigger problems.

But even modern systems can produce gotchas. The routines I use for converting Quark documents to web content have to allow for the fact that Quark may generate UTF-8 or UTF-16 files, depending on the version of the program and whether it's running on a Windows machine or a Mac.

I do a lot of work with Drupal, probably the most widely used open-source CMS on the web. It's based on the PHP programming language, which to this day is not entirely Unicode-compatible. This is a constant difficulty for Drupal users and developers.

The next version of Drupal, version 8, includes an entire UTF-8 utility class, mis-named as "Unicode," to compensate for PHP's problems and inconsistencies with text encodings. Drupal's token system breaks on Arabic strings because of historical PHP problems, and there are many other examples of Drupal developers struggling with PHP's text-encoding issues.

Drupal itself supports only UTF-8, which according to one Drupal honcho means "You no longer need to worry about language-specific encodings." Maybe he doesn't, but Microsoft's SQL Server 2012 does not support UTF-8 encoding, so you'll have to worry about it if you need to do a migration from that system.

I'm not picking on Drupal. Standardizing on UTF-8 is better than ignoring or minimizing the problem, and good for them for creating that utility class. Many proprietary systems are a lot worse, but I want you to realize that just because you're using a modern CMS doesn't mean you're safe from these issues.

I18n, Pronounced "It Ain't Ending"

Aside from content migration, the other place you're likely to stumble over text-encoding problems is on internationalization projects.

If you are planning on supporting languages that don't use the standard Roman character set, or don't run left to right, then you are probably making many unwarranted assumptions in your design as well as your content. I work with content so I'll leave those issues to someone else, but even ruling out those languages (i.e., confining your product or site to Western Europe and the Americas) will not protect you from these problems.

The most serious issue is that you are probably using systems and languages that claim to support Unicode, but in fact only support certain encodings, or only one. A CMS like Drupal can usually get away with that, but a programming language cannot; there is no way to avoid running into alternate encodings. Some languages and environments confuse support for writing code in Unicode encodings with handling content or data in those encodings.

Sometimes even systems that try to support Unicode break in unexpected ways once you start relying on that "support." Maybe they assume that one letter = one character. Maybe they use inflexible definitions of the separators between words (See this rant about Ruby's string.split function), or don't know how to distinguish upper from lower case. I've seen far too many examples of people converting lower case to upper case by adding 26 to the character code, or with tricks like tr/[A-Z]/[a-z]/.

Even if your application handles text encodings robustly, what do you know about the systems with which it interacts? Look at this poor Python programmer trying to properly encode characters for insertion into a database. If your content is in one encoding, and the database uses another, you'll need to convert it. And you'll also need to make sure that everyone talking to that database uses the proper encodings and character sets. It's even possible to unknowingly damage content in transit between systems that use the same encoding. For instance, if you try to import a UTF-8 file into a UTF-8 database, but forget to run SET NAMES utf8 when you run the mysql client to load the data, your import will fail. So will your export if you forget to use --default-character-set or --set-charset with mysqldump. If you're unlucky enough not to notice the error until you've been using the system for a while (and therefore can't just re-import), you'll end up with a horrid mess of attempted SQL corrections that will often do more harm than good.

What Should You Ask?

Ask your vendors and programmers what assumptions and gotchas may be hiding in your "plain text" files and the systems that handle them. (Don't forget that almost all modern web data formats -- HTML, XML, JSON, etc -- are at heart text documents, and can easily be destroyed by mishandling their encodings.)

If you get answers like, "Don't worry, it's a Unicode system," then you have a problem. (In fact, if anyone tells you that a system uses "Unicode encoding," you have a problem. Unicode is not an encoding.)

Make it your business to ask, and understand the basics of the answers to, questions like these:

  • What is our default encoding?
  • Does all our content have a declared encoding?
  • What about content from vendors and feeds?
  • What about legacy content?
  • What about our templates and code?
  • What about our documentation? Our variables? Our database structure? (I worked on a well-known proprietary CMS years ago whose database columns, and many variables, were all in Norwegian. Try setting that system up with ASCII encodings.)
  • What's the collation of our database? What character sets do we use for the database content? (If they think a "collation" and a "character set" are the same thing, smack them.) Do they match for all our different databases?

Solutions

Remember Postel's law. Make sure your systems are loosely coupled, in the sense that they don't make unwarranted assumptions and can gracefully handle unexpected inputs.

Use the right tools. For text-processing and migration tasks, I frequently use Perl, because it has the most comprehensive and reliable Unicode implementation of any language in its peer group. (In fact, if you read and understand the Perl manual page on Unicode, or even the Perl Unicode Tutorial and Perl Unicode Introduction, you'll be well on the way to understanding text encoding issues on the modern web.)

If you use other tools, pay attention to the documentation and to the assumptions you're making. Understand how your string functions work, and be aware of special functions like PHP's Multibyte String library or the UnicodeUtils Ruby gem.

Javascript developers are in a particularly difficult situation. You're at the mercy of multiple unknowns:

  • the declared encoding of the web page (if any)
  • how a particular browser handles that encoding
  • how the browser handles form uploads
  • what is being pasted into your forms

And to cope with those unknowns, you have fragile functions like utf8_encode() that work only on properly encoded ISO-8859 text. Which is really wonderful when creating JSON from user input, let me tell you.

If you prefer heavier languages like C and its cousins (C++, Java, C#, etc) then be very careful in all your string-handling code. You cannot p++ through a Unicode string. strlen() will lie to you. Laugh if you like, but mistakes like this are common causes of security exploits through buffer overruns. (Yes, Java is safer, but it only handles UTF-8 natively.)

Most importantly, test thoroughly! Make sure that all of your code is tested against all the encodings you claim to support. If you're converting from an English-only site to a site supporting multiple languages, you had better have at least one-third of your project timeline dedicated to testing. If you don't like the idea of increasing your timeline by a third, you also won't like the idea of doubling it with bizarre errors and corrupted content.

Practice Safe Text

No one involved in web content can afford to overlook the complexities of "plain text," especially not as the web becomes less and less US/western-oriented. If you're using buzzwords like "glocal" in front of your customers, you had better be thinking about issues like these back in the office. You can't have a good product without good content, and you can't have good content without understanding how it's encoded and stored.

Furthermore, remember this: The only difference between "harmless text" and "malicious code" is how you handle it. Careless handling of text, especially user-supplied text, is the root cause of some of the most common security holes on the web. Injection attacks in particular often start by fooling a naïve program into doing dangerous things with what it thinks is "safe" text. if you don't pay attention to issues like this, don't be surprised when your users start complaining about getting malware from your site.

The plain truth is, there is no plain text. Have fun and good luck.