Case in point -- let's take Twitter. Earlier today, I submitted the following update (because I was reading Astrid Lindgren's Karlsson-on-the-Roof in Chinese translation that I found on the web):
I was using the web interface to post it, and the counter at the top-right of the form dutifully told me that I had 53 more characters left, because Mozilla gets Unicode right. However, once I submitted the post, Twitter told me that oops, I went way over 140 characters and thus my post will be truncated when shown on the site.我风华正茂:英俊、绝顶聪明、不胖不瘦! (красивый, умный, в меру упитанный мужчина в полном расцвете сил) :D
Now, this disconnect happens to be because my post contained many multibyte characters -- 3 bytes per each Chinese character, and 2 bytes per each Cyrillic character:
A lot of software was written to deal with "unibyte" characters -- where one byte is used to represent a character, such as is the case with the venerable US-ASCII or ISO-8859-1. For example, such is the case with PHP -- when you use strlen() to calculate the length of the string, it will give you its length in bytes, and not its length in characters.print strlen('我') . "n"; // output: 3
print strlen('я') . "n"; // output: 2
While this is arguably a sane default behaviour (if I wanted the string size, I would have asked for a string size, not length?), the trouble here is that this chokes on multibyte characters when calculating string lengths. Furthermore, this practice often results in data mutilation, for example when trying to auto-calculate the "short version" of a string and then offer a "read more" link, or when truncating something to fit into visual space constraints.
Consider this:
That just sliced the string mid-character, and will usually show up as some version of an os-specific "[?]" glyph. And, of course, that actually truncated the string to 3 characters (plus junk) instead of achieving the wanted result.print substr('你叫什么名字?', 0, 10) . " (read more)n";
Different programming environments cope with this problem in different ways, but most of them require extra work. PHP deals with Unicode by providing an "mbstring" interface to most string functions. For example, we can use mb_strlen(), and mb_substr() to perform Unicode-aware string manipulation just as we would with regular strlen() and substr():
This will also do what we actually want and won't chop things off mid-torso:mb_internal_encoding('utf-8');
print strlen('你叫什么名字?') . "n"; // output: 21
print mb_strlen('你叫什么名字?') . "n"; // output: 7
PHP even has an option to replace all string functions with their multibyte equivalents, but this is rife with danger, because there's a good chance that the software you use will want to actually calculate the byte-length of a string and not its character length, i.e. when trying to figure out the size of a binary blob. Read more about mbstring and php.print mb_substr('你叫什么名字?', 0, 10) . " (read more)n";
Python, which is also internally all-ASCII all the time (until python 3000 comes around, that is), deals with multibyte strings in simlarly clunky ways:
In order to correctly handle Unicode strings, you have to first go from a string object to a unicode object by ways of using .decode('utf-8'):mystr = '你叫什么名字?'
print mystr[:4] + ' (read more)' # outputs junk
Alternatively, you can prepend all your unicode strings with 'u' to go straight to a unicode object, bypassing the ascii-centric string object:myuni = mystr.decode('utf-8')
print myuni[:4] + ' (read more)' # yay!
However, you'll still be doing a lot of .decode('utf-8') when you are doing things like reading data from a file. The linked talk is probably the most succinct and useful presentation on python and Unicode I've found: you should read it.myuni = u'你叫什么名字?'
print myuni[:4] + ' (read more)' # yay!
Conclusion:
Yes, Unicode is a pain in the ass, and requires jumping through extra loops whenever you get to deal with it. However, beleve me when I tell you that if you get used to the idea of Unicode from the very first line of your application, you won't have to later go back and rewrite things, potentially subtly breaking them in the process (e.g. see Twitter). Retrofitting an existing application to make it support Unicode quite often involves lots and lots of eye-stabbing and pain.
Oh, and the first person who says "why doesn't everyone just use English?" will be propmlty fed to the most rabid apparatchiks of the Office Québécois de la Langue Française. ;)



