[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: UTF-8+names
Tim Bray wrote: > > > Alessandro Triglia wrote: > > > As I understand, in UTF-8+name, an ampersand is represented > as &&; > > which means that, if UTF-8+name is used for XML, "normal" entity > > references will look like: > > > > &&;myentity; > > > > and numeric character references will look like: > > > > &&;#12345; > > No. &&; represents an ampersand. Normally it wouldn't be > used in text > you were going to feed to an XML processor because XML > processors don't > like that. & represents just "&" because UTF-8+names doesn't > assign a replacement. ü represents a single u+umlaut character, > inhereited from HTML. If my understanding is correct, UTF-8+names is just another encoding of Unicode, like UTF-8 or UTF-16. What an encoding (of Unicode) should do is define a mapping between Unicode characters (code points) and bit/byte patterns. Your document implies that AMPERSAND is encoded as the following sequence of 3 bytes: 0x26 0x26 0x3B (which, when interpreted as a UTF-8 encoding, looks like & & ;) and (for example) the character NO-BREAK SPACE (160) is encoded as the following sequence of 6 bytes: 0x26 0x6E 0x62 0x73 0x70 0x3B (which, when interpreted as a UTF-8 encoding, looks like & n b s p ;) I don't see this as fundamentally different from what (say) UTF-8 does, which encodes AMPERSAND as the single byte: 0x26 and NO-BREAK SPACE as a sequence of two bytes: first-byte second-byte (didn't spend time to determine them) Now, I see that in XML 1.0, an entity reference or numeric character reference is introduced by an AMPERSAND character. The actual bytes that represent the AMPERSAND character depend on the encoding used, and may or may not be a single 0x26 byte. Since in UTF-8+names AMPERSAND is encoded as 0x26 0x26 0x3B , an entity reference will be encoded as: 0x26 0x26 0x3B followed by the bytes encoding the characters of the name plus a semicolon which, when interpreted as a UTF-8 encoding, looks like & & ; m y e n t i t y ; I have indeed noticed in the I-D that a sequence of bytes that looks like a reference but is not recognized as a reference must be left as is by the codec, byte by byte. Therefore I will be able to use, as you say: & m y e n t i t y ; as an alternative to the full form: & & ; m y e n t i t y ; if and only if no replacement is defined for & m y e n t i t y ; in UTF-8+names and I know this. However, if a replacement is defined for & m y e n t i t y ; in UTF-8+names, I need to use the full form & & ; m y e n t i t y ; to prevent the codec from replacing my entity reference with its own replacement. What would be the recommended behavior of a program generating a UTF-8+names encoding from a string of Unicode characters? Whenever it encounters an AMPERSAND character in the string, what byte(s) should it generate for it? Should it look at the (XML 1.0) context to see if this ampersand is the first character of an XML entity reference or numeric character reference, and then generate a single 0x26 byte or the three bytes 0x26 0x26 0x3B depending on the context and depending on whether it has encountered an XML entity name that is identical to a replacement, and depending on whether the definition of that XML entity is identical to the replacement? This also means that the rules to be followed by the codec on encoding would depend on its knowledge of XML 1.0 (one layer above it), which I don't see as a desirable property of a codec. Would you recommend this complex behavior, or the simple and safe behavior of encoding all AMPERSANDs as 0x26 0x26 0x3B? Alessandro > > -- > Cheers, Tim Bray (http://www.tbray.org/ongoing/) > > > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org > <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: <http://lists.xml.org/ob/adm.pl>
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|