[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: 15 elementary truths about XML
Michael Kay scripsit: > This raises the interesting if somewhat academic question of what XML > would look like on a machine architecture using bytes or characters of > a length other than 8 bits. On the DEC PDP-10, words are 36 bits, but bytes can be any size from 1 to 36 bits. Bytes are always stored in big-endian order. The standard representation of ASCII used 7-bit bytes, five per word with one bit of wastage. Some kinds of text, like filenames, were stored in six 6-bit bytes by folding ASCII lower case to upper case and chopping off the high-order bit. To bring the PDP-10 into the Unicode age, Mark Crispin designed two new Unicode encodings suited to its architecture. In brief, UTF-9 stores each successive octet of a Unicode scalar value in the 8 low-order bits of one to three nonets, using big-endian ordering. The top bit is 0 in the final nonet and 1 in non-final nonets. UTF-18 stores the low-order 16 bits of a Unicode scalar value in the low-order 16 bits, and uses the top two bits to encode Plane 0, Plane 1, Plane 2, or Plane 14, the other planes being unrepresentable in this encoding. See RFC 4042 for details. Ken Thompson once said that the reason Unix was never ported to the PDP-10 was that there are no 9-bit magtapes. > As far as I can see, it would be entirely conformant to use an > encoding in which each Unicode character is mapped to a sequence of > one or more 13-bit bytes. The only slight problem is that an XML > parser that understands this encoding would not be conformant unless > it also understood UTF-8 and UTF-16; and it's not entirely clear to me > how UTF-8 and UTF-16 would look when stored on a machine with a 13-bit > byte length. I agree, although on such a machine it would probably be best to just stick to octets and waste the other 5 bits. That's essentially what the RFC recommends when you must use UTF-8 or UTF-16 on non-8-bit architectures. -- How they ever reached any conclusion at all <cowan@ccil.org> is starkly unknowable to the human mind. http://www.ccil.org/~cowan --"Backstage Lensman", Randall Garrett
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|