Byte Order Mark
|
Unicode |
---|
Encodings |
Bi-directional text |
BOM |
Han unification |
Unicode and HTML |
Unicode and Email |
A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32.
A BOM can also be used to indicate the encoding of unlabeled text in many Unicode encodings. In most encodings the BOM is a sequence which is unlikely to be seen in more conventional encodings or other Unicode encodings (usually looking like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within the text then it will generally be invisible due to the fact it is a ZERO-WIDTH NO-BREAK SPACE. The "zero width no-break space" function of the U+FEFF character has been deprecated in Unicode 3.2, allowing it to be used solely with the semantic of BOM.
In UTF-16, a BOM is expressed as the 2 byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order.
Whilst UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended as it will interfere with correct processing of important codes such as the hash-bang at the start of a file. The UTF-8 representation of the BOM is the byte sequence EF BB BF.
Whilst a BOM could be used with UTF-32 this encoding is almost never used for transmission anyway.
Representations of Byte Order Marks by Encoding
- UTF-8: EF BB BF
- UTF-16 Big Endian: FE FF
- UTF-16 Little Endian: FF FE
- UTF-32 Big Endian: 00 00 FE FF
- UTF-32 Little Endian: FF FE 00 00
- SCSU: 0E FE FF
- UTF-7: 2B 2F 76 and one of the following byte sequences [ 38 | 39 | 2B | 2F | 38 2D ]
- UTF-EBCDIC: DD 73 66 73
- BOCU-1: FB EE 28
See also
External links
- The Unicode Standard, chapter 13 (PDF) (http://www.unicode.org/unicode/uni2book/ch13.pdf) (see 13.6 - Specials)
- FAQ - UTF and BOM (http://www.unicode.org/unicode/faq/utf_bom.html)