[Home] [By Thread] [By Date] [Recent Entries]
This Blueberry issue is not a slam-dunk either way, it's a genuinely hard issue. Actually it's several genuinely hard issues in an unattractive package, namely: - the NEL character as a line separator - the proper relationship to Unicode - how to version XML At the moment, based on a certain amount of introductory thought, but not by an overwhelming margin, I lean to doing nothing; simply because the cost of the Blueberry action seems to outweigh its benefits. Benefits first: I think that the specific Blueberry suggestions (NEL and the new Name characters) are probably technically correct from any sensible reading of Unicode/ISO10646. Per Unicode, NEL is a first-class line delimiter, at least equal in status to CR and NL, and arguably superior since it's a single character with a clear semantic, not a holdover from archaic typewriter-cylinder control characters. Secondly, the benefit is significant. The IBM mainframe people followed all the rules in adopting NEL as a line separator in their mainstream software libraries, and if XML doesn't change it means you can either use standard software text-handling tools OR you can use XML, but not both. OS/MVS and its successors are hardly hip or fashionable, but they serve as the stewards of remarkably huge amounts of high-quality data, and any move to enable the XMLification of this stuff is praiseworthy. But the costs seem pretty darn high to me. If Blueberry is adopted and is given a new <?xml version number, this means that the mass of already-deployed XML software will correctly throw such data on the ground, at some considerable cost to interoperability. If there is no <?xml version number, then such software will try to read it but then unpredictably throw it on the ground upon encountering the first NEL that appears inside a tag - or the first element-type/attribute-name using one of the non-XML-1.0 name characters. Of these two, the second problem seems more damaging, so I'd argue pretty strongly for signaling Blueberry documents with some value of <?xml version="X" ?> where X is not "1.0". And the cost of this is very high. At the moment, XML 1.0 is pretty effectively one thing and just one thing, and if I claim to ship out XML and you claim to be able to read it, we can usually interoperate, especially since we're both probably using expat or xerces or msxml. Introducing Blueberry will impair this admirable simplicity. A subsidiary issue, by the way: If you add NEL to the set of line-end characters there are a bunch of other Unicode space and space-like characters that, to be fair, you're going to have to consider adding to the production for "S". And a possibly minor point: at the moment, all the "syntax" characters in XML (<, >, /, =, &, ;, [, ], ', and ") are in the one-byte Unicode range 0-127 which does enable some sneaky parser construction tricks - probably not a big deal though. Then another potential problem: if you decide to push XML past version 1.0, why not take the opportunity to pour in namespaces? And fix the white-space handling? And... well, probably nobody is willing to step off this cliff, so maybe I'm raising a red herring. The final issue, and I'm not sure whether it's a problem or an opportunity, is the nature of the relationship between XML and Unicode. In XML 1.0 1st & 2nd editions, it is clear that all Unicode characters (except for a few low-valued control characters, sigh) are legal in XML text, and then there's this exhaustive enumeration of the characters that are legal in XML names. This causes problems in two areas: how to keep up with changing versions of Unicode, and how to justify XML's private-label collection of Name characters? One could just outsource the problem to Unicode and say "a name character is what Unicode says", but XML 1.0 decided not to do this, after exhaustive consideration (most of which I've thankfully forgotten) and I've never heard a really powerful (either in conviction or logic) argument that this choice was wrong. There's a coherent but rather ad-hoc (and non-normative) explanation of the choice of XML name characters in one of the appendices. I believe John Cowan suggested that he has a better algorithm/heuristic? Please share. Having a cleaner simpler relationship between XML and Unicode is arguably a good thing. To summarize: The lack of ability to support standard mainframe software and certain language groups' characters in markup, while regrettable, is a problem whose cost is a judgement call. It is possible and reasonable to compare this cost with the cost described above of bifurcating XML 3 years into its life (another judgement call), and make a third judgement call as to the relative magnitude of those costs. To cast it in the starkest possible light: Is it a reasonable trade-off to say that we will live with an incorrect interpretation of Unicode in certain specific areas, with the consequences of complicating the lives of mainframe users and impoverishing the tools available to worthy users of certain minority languages, to achieve the benefit of keeping XML monolithic and unitary? Yes, it's reasonable. I might be convinced that it's wrong, but it's a reasonable argument that needs to be addressed. Corollary: it's not enough simply to say "Blueberry is more correct per Unicode thus we have to do it, end of debate." So I think it would be appropriate, in this discussion, to have some people in the mainframe trenches give us a briefing on the scale and the difficulty of the problems they face, and for some of our i18n gurus to highlight the problems faced by an XML language designer who wants to use one of the newly-added languages. On the other side, we should consider the practicalities and costs of upgrading (or not) the installed base in the face of the deployment of data encoded in XML Blueberry. I.e., let's keep this pragmatic. Pardon the length, I was sitting in SFO with an hour to kill. -Tim
|

Cart



