UTF-8
|
Unicode |
---|
Encodings |
Bi-directional text |
BOM |
Han unification |
Unicode and HTML |
Unicode and Email |
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.
It uses one to four bytes per character, depending on the Unicode symbol. For example, only one UTF-8 byte is needed to encode the 128 US-ASCII characters in the Unicode range U+0000 to U+007F.
Four bytes may seem like a lot for one character (code point); however, this is required only for code points outside the Basic Multilingual Plane, which are generally very rare anyway. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. Which is more efficient, UTF-8 or UTF-16, depends on the range of code points being used. Use of traditional compression systems like DEFLATE will significantly reduce the differences between different encoding schemes anyway. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead.
The IETF (Internet Engineering Task Force) requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding.
Contents |
Description
UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646).
In summary, the bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes.
A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII.
In other cases, up to four bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters, particularly characters with code points lower than U+0020, traditionally called control characters, for example, carriage return.
Code range hexadecimal |
UTF-16 | UTF-8 binary |
Notes |
000000 - 00007F | 00000000 0xxxxxxx | 0xxxxxxx | ASCII equivalence range; byte begins with zero |
seven x | seven x | ||
000080 - 0007FF | 00000xxx xxxxxxxx | 110xxxxx 10xxxxxx | first byte begins with 110 or 1110, the following byte(s) begin with 10 |
three x, eight x | five x, six x | ||
000800 - 00FFFF | xxxxxxxx xxxxxxxx | 1110xxxx 10xxxxxx 10xxxxxx | |
eight x, eight x | four x, six x, six x | ||
010000 - 10FFFF | 110110xx xxxxxxxx 110111xx xxxxxxxx | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 |
two x, eight x, two x, eight x | three x, six x, six x, six x |
For example, the character alef (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:
- It falls into the range of U+0080 to U+07FF. The table shows it will be encoded using two bytes, 110xxxxx 10xxxxxx.
- Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
- The eleven bits are put in their order into the positions marked by "x"-s: 11010111 10010000.
- The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That is the letter aleph in UTF-8.
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering the whole area U+0000 to U+7FFFFFFF (31 bits). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, and 0xF5 to 0xFF.
Modified UTF-8
The Java programming language, which uses UTF-16 for its internal text representation, supports a non-standard modification of UTF-8 for string serialization. This encoding is called modified UTF-8 (http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8).
There are two differences between modified and standard UTF-8. The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, perhaps to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated.
The second difference is in the way characters outside the BMP are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence. The reason for this modification is more subtle. In Java a character is 16 bits long; therefore some Unicode characters require two Java characters in order to be represented. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one Java character at a time, rather than one Unicode character at a time. Unfortunately, this also means that characters requiring four bytes in UTF-8 require six bytes in modified UTF-8.
Rationale behind UTF-8's mechanics
As a consequence of the exact mechanics of UTF-8, the following properties of multi-byte sequences hold:
- The most significant bit of a single-byte character is always
0
. - The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are
110
for two-byte sequences;1110
for three-byte sequences, and so on. - The remaining bytes in a multi-byte sequence have
10
as their two most significant bits.
UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8-encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.
Overlong forms, invalid input, and security considerations
The exact response of a decoder on invalid input is largely undefined. There are several ways a decoder can behave in the event of invalid input:
- Insert a replacement character (e.g. '?', '�')
- Skip the character
- Interpret the character as being from another charset (often Latin-1)
- Not notice and decode as if the character were some similar bit of UTF-8
- Report an error
Decoders may of course behave in different ways for different types of invalid input.
All possibilities have their advantages and disadvantages but care must be taken to avoid security issues if validation is performed before conversion from UTF-8.
Overlong forms (where a character is encoded in more bytes than needed but still following the forms above) are one of the most troublesome types of data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server.
To maintain security in the case of invalid input there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a strict decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless.
Advantages
- Of course, the most notable advantage of any Unicode Transformation Format over legacy encodings is that it can encode any character.
- Some Unicode symbols (including the Latin alphabet) in UTF-8 will take as little as one byte, although others may take up to four. So UTF-8 will generally save space compared to UTF-16 or UTF-32 in text where 7-bit ASCII characters are common.
- A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift-JIS (see the previous section on this).
- The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence (just count the number of leading set bits). This makes it extremely simple to extract a substring from a given string without elaborate parsing.
- Most existing computer software (including operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with the single-byte character 0x00 (see null-terminated string). In UTF-16 the Latin letter A will be coded as 0x0041. The library will consider the first byte, 0x00, as the end of the string and will ignore anything after it. With UTF-8, ASCII-valued bytes (0x00 to 0x7F) appear only when representing an ASCII character. Therefore a system designed for ASCII that tries to match null on a UTF-8 string will match only null.
- Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points, but this is unlikely to be considered a culturally acceptable sort order in most cases.
- UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration. [1] (http://www.w3.org/TR/REC-xml/#charencoding)
Disadvantages
- UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user. Also, UTF-16 is variable length as well, though many people do not know this (or do not care about code points outside the BMP).
- A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its 8-bit representation.
- Ideographs use three bytes in UTF-8, but only two in UTF-16. So Chinese, Japanese, and Korean text take up more space when represented in UTF-8. There are a few other less well known groups of code points that this also applies to.
History
UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout.
UTF-8 was first officially presented at the USENIX conference in San Diego January 25-29 1993.
External links
- Rob Pike tells the story of UTF-8's creation (http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)
- Original UTF-8 paper (http://www.cs.bell-labs.com/sys/doc/utf.pdf)
- RFC 3629, the UTF-8 standard
- RFC 2277, IETF policy on character sets and languages
- UTF-8 and Unicode FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html)
- UTF-8 (http://www.utf-8.com/)
- a UTF-8 test page (http://www.ccss.de/slovo/testuni.htm)
- another UTF-8 test page (http://www.unics.uni-hannover.de/nhtcapri/multilingual1.html)
- UTF-8 and Debian (http://www.melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/HOWTO/howto.html) and Linux UTF-8 How-To (http://www.linux.org/docs/ldp/howto/Unicode-HOWTO.html).ar:UTF-8
cs:UTF-8 de:UTF-8 es:UTF-8 fr:UTF-8 it:UTF-8 he:UTF-8 hu:UTF-8 nl:UTF-8 ja:UTF-8 no:UTF-8 pl:UTF-8 pt:UTF-8 sk:UTF-8 sl:UTF-8 sv:UTF-8 tr:UTF-8 zh:UTF-8