[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: 15 elementary truths about XML

  • From: John Cowan <cowan@mercury.ccil.org>
  • To: Michael Kay <mike@saxonica.com>
  • Date: Mon, 31 Oct 2011 15:45:55 -0400

Re:  15 elementary truths about XML
Michael Kay scripsit:

> This raises the interesting if somewhat academic question of what XML
> would look like on a machine architecture using bytes or characters of
> a length other than 8 bits.

On the DEC PDP-10, words are 36 bits, but bytes can be any size from 1
to 36 bits.  Bytes are always stored in big-endian order.  The standard
representation of ASCII used 7-bit bytes, five per word with one bit
of wastage.  Some kinds of text, like filenames, were stored in six
6-bit bytes by folding ASCII lower case to upper case and chopping off
the high-order bit.

To bring the PDP-10 into the Unicode age, Mark Crispin designed two new
Unicode encodings suited to its architecture.  In brief, UTF-9 stores
each successive octet of a Unicode scalar value in the 8 low-order bits
of one to three nonets, using big-endian ordering.  The top bit is 0 in
the final nonet and 1 in non-final nonets.  UTF-18 stores the low-order
16 bits of a Unicode scalar value in the low-order 16 bits, and uses the
top two bits to encode Plane 0, Plane 1, Plane 2, or Plane 14, the other
planes being unrepresentable in this encoding.  See RFC 4042 for details.

Ken Thompson once said that the reason Unix was never ported to the
PDP-10 was that there are no 9-bit magtapes.

> As far as I can see, it would be entirely conformant to use an
> encoding in which each Unicode character is mapped to a sequence of
> one or more 13-bit bytes. The only slight problem is that an XML
> parser that understands this encoding would not be conformant unless
> it also understood UTF-8 and UTF-16; and it's not entirely clear to me
> how UTF-8 and UTF-16 would look when stored on a machine with a 13-bit
> byte length.

I agree, although on such a machine it would probably be best to just
stick to octets and waste the other 5 bits.  That's essentially what the
RFC recommends when you must use UTF-8 or UTF-16 on non-8-bit architectures.

-- 
How they ever reached any  conclusion at all    <cowan@ccil.org>
is starkly unknowable to the human mind.        http://www.ccil.org/~cowan
        --"Backstage Lensman", Randall Garrett


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.