Text encoding
|
A text encoding is a method of representing text as binary values in computer storage. Since text is a sequence of characters, an encoding involves issues including the repertoire of characters to be represented, the mapping from this character repertoire to numeric values, and the details of the binary representation of the character values.
Earlier generations of encoding schemes, for example US-ASCII and EBCDIC, were simple; characters were identified by 7- or 8-bit numbers, and these numbers were stored directly, one per byte. Prior to the advent of Unicode, there were a large number of encoding schemes, none of which could represent the whole spectrum of human-readable texts. The ISO-2022 standard provided a mechanism for mixing encoding schemes in a single text via the use of "shift-in" and "shift-out" escape sequences; however, this was awkward for programmers and error-prone.
The World Wide Web Consortium specifies that all HTML, XML, and XHTML documents should be clearly labeled with a "charset" [1] (http://www.w3.org/International/O-charset.html)[2] (http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2) to indicate which of the many different text encodings is used for this particular document. Because there is no way to switch between the encodings, a multi-lingual document must use a charset (such as UTF-8) that can handle all the different languages in the document.
The Unicode standard introduced a level of indirection. Each Unicode character has a numeric value known as a "code point". For example, the code point for the ampersand (&) is 38 (Hex 26), and that for the well-known Han character 中 is 20,013 (Hex 4E2D). However, there are a variety of methods available for representing integers in computer storage. Unicode defines special Unicode Transformation Formats for this purpose, the best known of which are UTF-8 and UTF-16.
When computer professionals discuss "encodings" (particularly in the context of XML), they are quite likely to be referring to this integer-to-binary mapping. Non-specialists often use the term for data formats ranging from ASCII to HTML and RTF.