[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: UTF-8+names


utf 8 ampersand


Tim Bray wrote:
> 
> 
> Alessandro Triglia wrote:
> 
> > As I understand, in UTF-8+name, an ampersand is represented 
> as  &&;  
> > which means that, if UTF-8+name is used for XML, "normal" entity 
> > references will look like:
> > 
> > 	&&;myentity;
> > 
> > and numeric character references will look like:
> > 
> > 	&&;#12345;
> 
> No.  &&; represents an ampersand.  Normally it wouldn't be 
> used in text 
> you were going to feed to an XML processor because XML 
> processors don't 
> like that.  & represents just "&" because UTF-8+names doesn't 
> assign a replacement.  ü represents a single u+umlaut character, 
> inhereited from HTML.


If my understanding is correct, UTF-8+names is just another encoding of
Unicode, like UTF-8 or UTF-16.

What an encoding (of Unicode) should do is define a mapping between Unicode
characters (code points) and bit/byte patterns.  Your document implies that
AMPERSAND is encoded as the following sequence of 3 bytes:

	0x26 0x26 0x3B

(which, when interpreted as a UTF-8 encoding, looks like  & & ;)

and (for example) the character  NO-BREAK SPACE  (160) is encoded as the
following sequence of 6 bytes:

	0x26 0x6E 0x62 0x73 0x70 0x3B

(which, when interpreted as a UTF-8 encoding, looks like  & n b s p ;)

I don't see this as fundamentally different from what (say) UTF-8 does,
which encodes  AMPERSAND  as the single byte:

	0x26

and  NO-BREAK SPACE  as a sequence of two bytes:

	first-byte second-byte  (didn't spend time to determine them)


Now, I see that in XML 1.0, an entity reference or numeric character
reference is introduced by an  AMPERSAND  character.  The actual bytes that
represent the  AMPERSAND  character depend on the encoding used, and may or
may not be a single 0x26 byte.

Since in UTF-8+names  AMPERSAND  is encoded as  0x26 0x26 0x3B , an entity
reference will be encoded as:

	0x26 0x26 0x3B  followed by the bytes encoding the characters of the
name plus a semicolon

which, when interpreted as a UTF-8 encoding, looks like

	& & ; m y e n t i t y ;


I have indeed noticed in the I-D that a sequence of bytes that looks like a
reference but is not recognized as a reference must be left as is by the
codec, byte by byte.  Therefore I will be able to use, as you say:

	& m y e n t i t y ;

as an alternative to the full form:

	& & ; m y e n t i t y ;

if and only if no replacement is defined for  & m y e n t i t y ;  in
UTF-8+names and I know this.

However, if a replacement is defined for   & m y e n t i t y ;  in
UTF-8+names, I need to use the full form    & & ; m y e n t i t y ;   to
prevent the codec from replacing my entity reference with its own
replacement.


What would be the recommended behavior of a program generating a UTF-8+names
encoding from a string of Unicode characters?  Whenever it encounters an
AMPERSAND  character in the string, what byte(s) should it generate for it?
Should it look at the (XML 1.0) context to see if this ampersand is the
first character of an XML entity reference or numeric character reference,
and then generate a single  0x26  byte or the three bytes  0x26 0x26 0x3B
depending on the context and depending on whether it has encountered an XML
entity name that is identical to a replacement, and depending on whether the
definition of that XML entity is identical to the replacement?

This also means that the rules to be followed by the codec on encoding would
depend on its knowledge of XML 1.0 (one layer above it), which I don't see
as a desirable property of a codec.

Would you recommend this complex behavior, or the simple and safe behavior
of encoding all  AMPERSANDs  as  0x26 0x26 0x3B?

Alessandro


> 
> -- 
> Cheers, Tim Bray (http://www.tbray.org/ongoing/)
> 
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org 
> <http://www.xml.org>, an initiative of OASIS 
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.