WowAce.com Knowledge base

Projects / Encodings, or: Validating UTF8

r4

Contrary to what some people expect, not every file is valid UTF8. This means that if you mess up and mangle your encodings, the files may get rejected or cause problems with various software, such as the CurseForge packager or WoW.

Note: The repository hooks validate the encoding of Lua files before accepting them.

Notation

Most numbers are in decimal; byte values displayed as AB CD are hexadecimal.

Encodings in a nutshell

At the most fundamental level, computers encode information in chunks called bytes (or more precisely, octets; we shall assume that bytes are 8 bits). Most encodings just use one byte per character. This means there are 256 code points that can be mapped to characters (or digits, punctuation, Kanji, runes, whatever -- we'll call them characters).

While there are dozens of encodings still in use, the following are especially important to this discussion:

  • ASCII is the basis of most western encodings today. Due to its 7-bit heritage, it only maps 128 code points: 0-127.
  • The ISO 8859 family of encodings are the most frequently used western encodings. Its parts 1–15, known also as "Latin-1" etc., are very widely used. Code points 0-127 agree with ASCII; the rest are used for various characters, such that it is possible, e.g., to type most western European languages in Latin-1.
And here's the first catch: the Latin-N encodings do not cover all 256 code points. Specifically, 128-159 are unassigned. The encodings formally known as ISO-8859-1 etc. (note the extra dash) assign control characters to this range, but see below.
  • Windows-1252 in turn is a superset of ISO Latin-1 (ISO 8859-1), which maps printable characters into the range 128-159. This means it is not ISO-8859-1 (note the dash) compatible! As the name suggests, it is in wide use on western European Windows installations.

But 256 possible characters are quite restrictive. For example, it is not possible to write Chinese or Japanese in their native writing systems (by a wide margin). Along came Unicode and the UCS (Universal Character Set, formally ISO 10146), which intends to provide characters for all writing systems in use on the Earth.

This also means that a different mapping to bytes has to be used. Common variants include:

  • UTF-16, which uses two bytes (16 bits) per character and thus limited the original standard to 65536 code points (more are possible with escape codes).
  • UTF-32, which uses four bytes per character and allowed the second version of the Unicode standard to move beyond the 65k limit. (Unicode now has "room" for 1.1 million characters.)
  • The UTF-1/7/8 transformations, of which UTF-8 is by far the most commonly used.

Unicode notably includes all of Latin-1 on the 256 lowest-numbered code points, which also means that it contains all ASCII characters on the same code points as in ASCII itself.

UTF-16/32 unfortunately require special support from software to be handled sanely. For example, the 'A' character would be encoded in (little-endian) UTF-32 as the bytes 41 00, and a lot of unprepared software will choke on the null byte.

UTF-8 circumvents this problem by guaranteeing the following properties:

  • All ASCII code points (i.e. 0–127) map to the corresponding ASCII bytes (i.e. 0–127).
  • All other code points map to a series of bytes, all of which have values in the range 128–255.
  • No encoding of a character is contained in a (longer) encoding of another character.
  • No encoding contains FE or FF.

This means that it is ASCII-compatible in the same sense that Latin-1 is: software that expects input to be "ASCII and maybe some higher bytes" will be able to cope with it.

However, not all byte sequences are valid UTF-8. The details are linked to the last three properties: roughly speaking, 007F correspond to ASCII, C2F4 are legal at the start of a multibyte sequence, and 80BF constitute the rest of the multibyte sequence. (There are some exceptions.) F5FD are currently invalid, but reserved for 5- and 6-byte sequence leaders if UCS ever introduces more characters. Other combinations of these bytes are invalid.

This concludes the general remarks. You should remember the following points:

Summary: ASCII only maps half the possible byte values. Latin-1, Win-1252 and UTF-8 are all different supersets of ASCII. UTF-8 can encode every language in use. Not every sequence of bytes is valid UTF-8.

Encoding detection and BOM

Unicode provides a special means to detect the specific encoding (and endianness) that a file was written in: the Byte-Order Mark (BOM). This is just the special code point (U+FEFF) which is invisible as a character (it is a zero-width non-breaking space), but serves to distinguish the encodings via its byte representation:

  • EF BB BF for UTF-8,
  • FF FE for little-endian UTF-16,
  • FF FE 00 00 for little-endian UTF-32, etc.

Thus, a file can be marked as UTF-8 simply by putting a BOM at the very beginning. When the file is erroneously interpreted as a different encoding, the BOM will usually appear as three garbage bytes.

Many programs also use a heuristic to detect UTF-8, exploiting the property that not all byte sequences are valid: simply attempt to decode the file as UTF-8; if that fails, it's probably Latin-1.

How does this affect my addon?

Because Unicode can express all languages, WoW and WAR both want their addon files to be encoded in UTF-8.

Now this would be the end of the story, if there weren't so many brain damaged (or simply outdated) editors out there that are encoding unaware, and will happily trash anything that goes beyond ASCII. Before the advent of UTF-8, this wasn't so bad; many files would simply look wrong if read with the wrong encoding, but in any half sane editor, the damage was limited.

But with UTF-8, it's a different story: suppose you open a file encoded in UTF-8 in an editor which treats the contents as Latin-1. Now you cut&paste some text, perhaps containing umlauts, into it. What happens?

Most likely the editor will happily save the Latin-1 bytes into the file. Remember that not all byte sequences are valid UTF-8? The file is now most likely corrupted—the author has yet to see a UTF-8 (non-ASCII) byte sequence that makes sense in Latin-1 (in any language); conversely, nothing that makes sense is valid UTF-8!

To fix such corruption, you need to identify the offending bytes (see the next section). Then attempt to guess the encoding they are in, usually Win-1252 is a good starting point, look up what characters they represented, then insert those characters with a UTF aware editor.

Other sources of badly encoded files are more obvious; for example, the author helped debug one case where the Lua sources were generated by a PHP script that simply used Latin-1 output.

Summary: Editing a UTF-8 encoded file in a non-UTF-aware editor will most likely leave it invalid.

Checking for valid UTF-8

This is not easily possible with tools that are provided with Windows. You can, however, install the Python programming language, open an interactive Python window and use the following commands:

>>> s = open(r"c:\path\to\file").read()
>>> s.decode("utf8")

This loads the entire file into RAM and attempts to decode it as UTF-8 (into the internal unicode representation of your Python interpreter, most likely UTF-32). If the result is a long string containing the file contents, all is well. However, if the file is not valid, you will get a message along these lines:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 13-14: invalid data

Unless you have an editor that can jump to a certain byte position, you can slice a bit of the string to get some context:

>>> s[10:20]
'kc\x1c\xd6\x00\x82^\xff\xff\xff'

(In this case, the author just used a bzip2 compressed file to provoke the error.)

Closing remarks

Note that it is generally a bad idea to debug encoding problems over the internet. Pastebins are especially useless: the posting browser, the pastebin software and the viewing browser all have a chance to switch encodings, and they usually do. Mails are slightly better, but some MUAs are broken too. Similarly, IRC provides few guarantees, though many clients (with the notable exception of the widely used mIRC) now default to UTF-8.

If you must discuss such byte-level issues, the most reliable tool is a hex dump. OS X and Linux users can use the powerful xxd utility. On Windows, you can resort to Python (if you installed it for the last section), and use its own string representation which encodes the problematic characters as in '\xAB'.