Re: 15 elementary truths about XML

From: John Cowan <cowan@mercury.ccil.org>
To: Michael Kay <mike@saxonica.com>
Date: Mon, 31 Oct 2011 15:45:55 -0400

Play the video

Michael Kay scripsit:

> This raises the interesting if somewhat academic question of what XML
> would look like on a machine architecture using bytes or characters of
> a length other than 8 bits.

On the DEC PDP-10, words are 36 bits, but bytes can be any size from 1
to 36 bits.  Bytes are always stored in big-endian order.  The standard
representation of ASCII used 7-bit bytes, five per word with one bit
of wastage.  Some kinds of text, like filenames, were stored in six
6-bit bytes by folding ASCII lower case to upper case and chopping off
the high-order bit.

To bring the PDP-10 into the Unicode age, Mark Crispin designed two new
Unicode encodings suited to its architecture.  In brief, UTF-9 stores
each successive octet of a Unicode scalar value in the 8 low-order bits
of one to three nonets, using big-endian ordering.  The top bit is 0 in
the final nonet and 1 in non-final nonets.  UTF-18 stores the low-order
16 bits of a Unicode scalar value in the low-order 16 bits, and uses the
top two bits to encode Plane 0, Plane 1, Plane 2, or Plane 14, the other
planes being unrepresentable in this encoding.  See RFC 4042 for details.

Ken Thompson once said that the reason Unix was never ported to the
PDP-10 was that there are no 9-bit magtapes.

> As far as I can see, it would be entirely conformant to use an
> encoding in which each Unicode character is mapped to a sequence of
> one or more 13-bit bytes. The only slight problem is that an XML
> parser that understands this encoding would not be conformant unless
> it also understood UTF-8 and UTF-16; and it's not entirely clear to me
> how UTF-8 and UTF-16 would look when stored on a machine with a 13-bit
> byte length.

I agree, although on such a machine it would probably be best to just
stick to octets and waste the other 5 bits.  That's essentially what the
RFC recommends when you must use UTF-8 or UTF-16 on non-8-bit architectures.

-- 
How they ever reached any  conclusion at all    <cowan@ccil.org>
is starkly unknowable to the human mind.        http://www.ccil.org/~cowan
        --"Backstage Lensman", Randall Garrett

References:
- 15 elementary truths about XML
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: 15 elementary truths about XML
  - From: Michael Kay <mike@saxonica.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >