UTF-32
|
Unicode |
---|
Encodings |
Bi-directional text |
BOM |
Han unification |
Unicode and HTML |
Unicode and Email |
UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. It can be regarded as the simplest possible way, as all other Unicode Transformation Formats have variable-length encodings for various characters. However, a notable drawback of UTF-32 is that it requires up to two to four times the storage space of traditional encodings. UTF-32 is generally not as efficient on memory usage and memory bandwidth when compared to UTF-16 or UTF-8. This is why it is rarely used for external storage, but only internally when character handling is required to be as simple as possible.
UCS-4
The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.
UCS-4 is sufficient to represent all of the Unicode code space, which has 1114112 (= 220+216) code points and therefore requires only up to hexadecimal 10FFFF. Some people consider it wasteful to reserve such a large code space for mapping a relatively small set of code points, so a new encoding form, UTF-32, was proposed. UTF-32 is a subset of UCS-4 that uses 32-bit code values only in the 0 to 10FFFF code space.
UTF-32 and UCS-4
UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.
Accordingly UCS-4 and UTF-32 can be now taken to be identical save that the UTF-32 standard has additional Unicode semantics that must be observed.
External links
- The Unicode Standard 4.1, chapter 3 (http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf) - formally defines UTF-32 in §3.10, D43-D45
- Unicode Standard Annex #19 (http://www.unicode.org/reports/tr19/tr19-9.html) - formally defined UTF-32 for Unicode 3.x (March 2001; last updated March 2002)
- Registration of new charsets: UTF-32, UTF-32BE, UTF-32LE (http://mail.apps.ietf.org/ietf/charsets/msg01095.html) - announcement of UTF-32 being added to the IANA charset registry (April 2002)de:UTF-32