Big5
|
- For other uses, see big five (disambiguation).
Big-5 or Big5 is a character encoding method used in Taiwan (Republic of China) and Hong Kong for Traditional Chinese characters. Its Mainland China equivalent is GB.
Contents [hide] |
Name
Big5's Chinese name 五大碼 (pinyin: wǔdà mǎ), means "Big Five Encoding." The name refers to the original design goal to support the five major software packages used in Taiwan at the time, or to the five leading computer companies in Taiwan (宏碁 (hóng qí; Acer [1] (http://www.acer.com.tw)), 神通 (shén tōng; MiTAC [2] (http://www.mitac.com.tw)), 佳佳 (jīa jīa; ?), 零壹 (líng yī; Zero One ([3] (http://www.zerone.com.tw)), 大眾 (dà zhòng; FIC [4] (http://www.fic.com.tw))) that collaborated to develop the code.
The English name of the encoding, "Big5", was subsequently (mistakenly) translated back to Chinese from English as 大五碼 (dàwǔ mǎ). Both Chinese names are now in use.
Organization
The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by KangXi Radicals.
The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.
The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS) with the following structure:
First byte ("lead byte") | 0xa1 to 0xfe |
Second byte | 0x40 to 0x7e, 0xa1 to 0xfe |
Certain variants of the Big5 character set, for example the HKSCS, uses an expanded range for the lead byte including values in the 0x80 to 0xA0 range (similar to Shift JIS).
If the second byte is not in the correct range, behaviour is undefined (i.e., varies from system to system).
The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.
Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with ASCII (or some other 8-bit character set, such as code page 437 in early DOS-based Chinese systems), so that you will find a mix of DBCS and ASCII in Big5-encoded text. Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be ASCII.
The meaning of non-ASCII bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.
A more detailed look at the organization
In the original Big5, the encoding is compartmentalized into different zones:
0xa140 to 0xa3bf | "Graphical characters" 圖形碼 |
0xa3c0 to 0xa3fe | Reserved for user-defined characters 造字 |
0xa440 to 0xc67e | Frequently used characters 常用字 |
0xc6a1 to 0xc8fe | Reserved for user-defined characters |
0xc940 to 0xf9d5 | Less frequently used characters 次常用字 |
0xf9d6 to 0xfefe | Reserved for user-defined characters |
The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), dingbats, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for Suzhou numerals, zhuyin fuhao, etc.)
In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the 0xa3c0–0xa3fe range, and additional ideograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range. Sometimes, this is not possible due to the large number of extended characters to be added; for example, Cyrillic letters and Japanese kana have been placed in the zone associated with "frequently-used characters".
What a Big5 code actually encodes
Contrary to popular belief, an individual Big5 code does not always represent a complete semantic unit. The Big5 codes of ideograms are always ideograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of double-byte character sets and is not a unique problem of Big5.
To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.
Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (0xa1ca, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals, which is a form of scientific notation that requires the number to be laid out in a 2-D form consisting of at least two rows.
History
The Big5 encoding was defined by the Institute for Information Industry of Taiwan in 1984. According to some accounts, Big5 was popularized by its adoption in several commercial software packages, especially the ET Chinese system which ran on MS-DOS.
The Republic of China government declared it their standard in mid-1980s since Big5 was already the de facto standard by that time.
Hong Kong also adopted Big5 for character encoding. However, Cantonese uses many archaic Chinese characters that were not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions "Government Chinese Character Set" in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions are commonly distributed as a patch.
See also
External links
- Chinese character codes: an update (http://kura.hanazono.ac.jp/paper/codes.html) by Christian Wittern
- CNS 11643 official web site (http://www.cns11643.gov.tw) has information about the Big5e character set (an extended version of Big5) in the "Chinese Information Code" section
- Hong Kong Supplementary Character Set Info (http://www.info.gov.hk/digital21/eng/hkscs/)
References
- Lunde, Ken (1999). CJKV Information Processing. First Edition. O'Reilly and Associates, Inc. ISBN 1565922247.fr:Big-5