Re: XML-1.1 -- just ignore it
First I should say - just in case there's any doubt about it - that though I am a member of the XML Core WG, this is just my personal view. I don't really care about any of this stuff. I have little use for anything but ASCII in XML names. Control characters rarely cause me trouble. I don't use IBM mainframes. I have no wish to ever normalize Unicode text. But the Core WG has been asked to consider all these issues, which for most of us is just another chore. There's only a very small group of people who find character set issues interesting, and I'm not one of them. In article <004301c1847b$8f696f20$01b7c0d8@A...> Rick writes: >By allowing any character in names, it means that we can have WF XML 1.1 >documents which merely opening in a text editor (even an editor for the >document encoding) will corrupt with a well-formedness error: if people use >characters in names which may be split at by automated line-wrapping. A >markup language which safe practise is to *never* open an entity in a text >editor? Excellent advance! Evidently people don't want to be stuck with Unicode 2.1 for XML Names. Now XML could either move to some newer version of Unicode, or have some automatic mapping from Unicode character classes to name characters, or just allow (almost) everything and say that it's not the business of XML to deal with this sort of detail. The first two both require parsers to change as Unicode changes; the third is a once-and-for-all change - I think that's the main reason why it is what has been proposed. The issue of line-wrapping is not one I was aware of, and one purpose of publishing Working Drafts is precisely to get feedback from experts like Rick on such things. On the other hand, I would never use an editor that line-wrapped *any* of my files with an explicit command. Do you really use such things on XML files? (I did once inadvertently do fill-region on a Python program, which took an hour to fix.) >I would guess that putting in Issue 18 and Issue 21 (should control >characters >be allowed? should 0x00 be allowed?) are just sacrificial lambs, put in to >be removed later but not serious suggestions. Not really. They were both seriously proposed. I think the case of nul is certain to be rejected; it was left in for completeness. But there are many applications where it would be useful to include control characters - you have only to look back at the archives of this mailing list and comp.text.xml to see people asking for advice on the matter. >A markup language which was unsafe to store in files I don't know what you mean by that. >or to transmit on serial lines Really? I have often transmitted 8-bit data over serial lines. >or as text/*? If people want to include control characters in their data, they will have that problem regardless of whether they mark it up in XML. One possibility I have heard of is that control characters could be allowed, but only as character references. Your comment about "sacrificial lambs" seems to suggest that someone is trying to push this through against public opposition. As far as I am concerned, we are only doing this because people wnat it. In fact, yours is the only such strong opposition I have heard. >It would be interesting to speculate what principle causes characters to be >considered whitespace: certainly it is not that all visible space should be >whitespace (one sensisble rule) or that only ASCII should be space. >Why is not just mapping NEL to #A on input enough to satisfy the IBM >requirement? Enything mapped to #A on input should also be a whitespace character, so that it behaves consistently if it appears as a character reference in an internal entity. 2028 has been added because it has Unicode backing as a universal line separator. >This gives us a markup language in which all markup a WF document could look >by inspection as if every character is ASCII but could not be serialized out >to ASCII. because of NELs or LS characters. Unicode is full of characters that look like ASCII but aren't. Most of the Greek capitals for example. >Another great joke is to "simplify" the naming rules to free a parser from >having to worry about future upgrades to Unicode, but then requiring >Normalized data (and suggesting it should be an error): surely this just >ties the parser to having to know a particular version of Unicode to know >which normalization rules to use! I think most parser writers don't want to have to check for normalization; it is the i18n people who are pressing for this. I agree that this seems to either argue against normalization or weaken the case for allowing all characters in names. Or is it intended that future Unicode additions won't change the normalization rules? -- Richard
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format