[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Recent XML WG decisions
While it is not our usual policy to post decisions of the XML Working Group to xml-dev, the last three WG meetings have seen a number of issues decided that bear directly on current experimental XML implementations. Following are reports prepared by C. M. Sperberg-McQueen and Tim Bray detailing recent decisions that will be incorporated into the next working draft. Jon ---------------------------------------------------------------------- Jon Bosak, Online Information Technology Architect, Sun Microsystems 901 San Antonio Road, MPK17-101, Palo Alto, California 94303 ---------------------------------------------------------------------- ISO/IEC JTC1/SC18/WG8::NCITS V1::Davenport::SGML Open::W3C XML WG It is earlier than we think. -- Vannevar Bush ---------------------------------------------------------------------- From: "C. M. Sperberg-McQueen" <cmsmcq@h...> Subject: XML WG decisions of 27 August 1997 The XML Work Group discussed the following questions, and made the decisions indicated, in the meeting of 27 August 1997. Present: Jon Bosak, James Clark, Steve DeRose, Eliot Kimber, Eve Maler, Makoto Murata, Peter Sharpe, C. M. Sperberg-McQueen. 1. A decision on case folding was postponed. Background: The current draft XML spec requires that most names (i.e. generic identifiers, attribute names, IDs, IDREFs, name tokens in attribute values PI targets, notation names, and document type names) be case-folded, while entity names are case sensitive. It has been repeatedly urged that this be changed and that all names be case-sensitive. The arguments are familiar: For case folding: since the reference concrete syntax requires case folding, many current users of SGML and HTML are familiar with and have come to expect this behavior. For case sensitivity: since SGML parsers are required to fold up, rather than down, the XML spec is inconsistent with recommended Unicode practice. (Unicode recommends folding down rather than up since there are slightly fewer unpleasant surprises and inconsistencies that way.) There is *no* rule for case folding which works in the culturally expected manner for all speakers of all alphabetic languages: a lower-case e with acute accent is (correctly) uppercased one way in Quebec and a different way in metropolitan France. Lowercase I (with a dot) is uppercased one way in Turkish and another way in other languages using the Latin alphabet. A strong majority of those participating felt that we should make XML case sensitive and drop case folding, but in view of the sensitive nature of the decision, it was decided to postpone the decision until a larger fraction of the work group was present. 2. XML characters range from #x0 to #x10FFFF. Decision: Legal XML characters are those representable in UTF-16 / Unicode 2.0, i.e. those in the first seventeen planes of ISO/IEC 10646. Unanimous. Rationale: The current spec says that XML characters may include any character defined by ISO/IEC 10646. Currently, that standard defines characters only within the Basic Multilingual Plane, each of which can be represented by a string of 16 bits; in principle, however, ISO/IEC 10646 defines a 31-bit character space, and production 2 accordingly defines Character as covering the range #x0 to #x7FFFFFFF, with some gaps for forbidden characters. XML processors, however, are not required to support the flat 32-bit character encoding UCS-4, only the 16- and 8-bit encodings of UCS-2 and UTF-8. (The latter can represent all the characters of the 31-bit character space, but UCS-2 cannot.) In many places, the XML spec suggests, or at least allows incautious readers to believe, that XML characters are only 16 bits wide. Either way, it's important to eliminate the ambiguity in the spec. In favor of restricting XML characters to 16 bits: it simplifies life for users of Java and other tools. It seems clear that the full 31-bit space of 10646 will not be needed, even for extremely specialized applications, in the foreseeable future. In favor of defining XML characters to be 31 bits wide: 16 bits is manifestly too few for anyone working with historical texts in Han characters. Politically, it would be unwise to give the impression that only the Basic Multilingual Plane is of importance. The surrogate method, while clever, is clearly a hack which demonstrates that the original Unicode claim (16 bits is enough to build an absolutely flat character space which will last for all time) has fallen apart under the pressure of fact; the surrogate method abandons the flat character space which is one of the most important advantages of Unicode. The compromise (BMP plus the next 16 planes) appears - well understood - compatible with Java and other tools which assume 16-bit characters - sufficient for realistic expectations (even the most extensive of known collections of historical Chinese characters is unlikely to take much more than one of the additional planes; even the user area is sufficiently large, with 131,072 character positions) 3. Processors must support UTF-16, not just UCS-2. Background: the current draft spec says (4.3.3): "All XML processors must be able to read entities in either UTF-8 or UCS-2." It has been proposed to change this to require support for UTF-8 and UTF-16 (which is UCS-2 plus support for the surrogate-character mechanism by which characters outside the Basic Multilingual Plane may be encoded). Decision: (i) XML processors must support 16-bit data streams (i.e. UTF-16) for input. (ii) They must not corrupt surrogate characters. (iii) If the processor uses a 16-bit buffer or a 16-bit interface to the downstream application, it must correctly represent numeric character references to non-BMP characters as pairs of surrogate characters. Unanimous. Rationale: since all name characters in XML are in the Basic Multilingual Plane, characters outside the BMP can only appear in XML documents as data. Since an XML processor is required to do nothing more to data than store it and pass it to the downstream application without corrupting it, no special handling is required for surrogate characters. The only new requirement is that processors understand the surrogate-character mechanism for characters outside the BMP, and use it, when necessary, to handle numeric character references correctly. 4. XML will refer to Unicode 2.0 and ISO/IEC 10646 with Am. 1-7. The current draft spec refers to Unicode 2.0 and ISO/IEC 10646 with Amendments 1 through 5. It has been suggested (a) that XML should refer *only* to Unicode, and (b) that the reference should be to "the current version" of Unicode, so that as Unicode is revised, XML automatically accepts the revisions. Decision: refer to 10646 with Amendments 1 through 7, but otherwise retain the current reference. I.e. do not drop the reference to ISO/IEC 10646, and do not phrase the reference so as to incorporate changes to Unicode automatically. Unanimous. Rationale: the agreement between ISO/IEC JTC1/SC2 and the Unicode Consortium to keep Unicode and 10646 synchronized is extremely important to all users. A joint reference to both standards makes clear to both parties that we, as users, wish them to honor that agreement. A reference solely to Unicode would imply clearly that XML would follow Unicode even if Unicode were to diverge from ISO/IEC 10646. The joint reference makes clear our intent: if the Unicode Consortium and SC2 fail to keep the two standards in synch, then XML is not guaranteed to follow either of them. Reference to as yet unpublished standards (which is what reference to "the most recent version" amounts to) is unwise because there is and can be no guarantee that revisions in Unicode and 10646 will not require corresponding revisions to the XML spec. 5. Encoding of external text entities is kept as is. It has been suggested that by allowing external entities to be in different character encodings, XML is incompatible with ISO 8879, which does not allow this. The WG unanimously reaffirmed its belief that the current draft spec is in fact compatible with ISO 8879 under what is sometimes called the 'new' character model. SGML documents must have a single document character set declaration and thus a single document character set, but this reflects the output from, not the input to, the entity manager, and is thus independent of the character encoding encountered in the actual data stream of the external text entity. 6. Ideographic space is not white space. Decision (unanimous): ideographic space (#x3000) will be removed from the non-terminals S and PubidCharacter. Rationale: Ideographic space corresponds more closely to the no-break space (#xA0, ) than to the standard space character (#x20). #xA0 is not allowed in S, and neither should ideographic space be. It is unlikely, with current standard input methods for kanji, that any operator would unintentionally or accidentally insert an ideographic (#x3000) rather than a Latin (#x20) space within a tag. 7. Binding sources of information for character encodings will be specified. The current draft spec says nothing about the priority of various sources of information regarding character encodings. Some participants (notably Gavin Nicol and Makoto Murata) have argued that this should be specified. Decision: The spec should include wording to the following effect: If an XML document or entity is in a file, the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery. If an XML document is delivered via the HTTP protocol with a MIME type of text/xml, then the HTTP header determines the character encoding method; all other heuristics and sources of information are solely for error recovery. If an XML document is delivered via the HTTP protocol with a MIME type of application/xml, then the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery. -C. M. Sperberg-McQueen From: "C. M. Sperberg-McQueen" <cmsmcq@h...> Subject: XML WG decisions of 3 September 1997 The XML Work Group met today (3 Sept 1997) and made the decisions described below. Present were Jon Bosak (JB), Tim Bray (TB), James Clark (JC), Dan Connolly (DC), Steve DeRose (SJD), Paul Grosso (PG), Dave Hollander (DH), Eliot Kimber (EK), Murray Maloney (MMa), Makoto Murata (MMu), Joel Nava (JN), Jean Paoli (JP), Peter Sharpe (PS), and Michael Sperberg-McQueen (MSM). 1. Procedures for determination of character encoding to be described in an appendix. Background: last week's report of decisions (31 August, posting from U35395@U...), included as item 7 a decision regarding "Binding sources of information for character encodings". The WG revisited the issue, noted that in fact no formal vote on it had been taken (error in the report), and discussed whether such rules belong in the XML language spec or not. Against inclusion: the rules really apply to the delivery of XML in very specific protocol environments, and should be included in the specification of the protocol. XML will be delivered by many protocols, some of them not yet invented; the language spec should not have to be revised every time a new protocol is deployed or invented. For inclusion: such conventions are important for encouraging interoperability of XML software. Conforming processors reading the same material in the same environment should make the same decisions about the character encoding. Decision: The rules for locating binding information about the character encoding of XML entities (reported last week) will be described in an appendix. They will be accompanied by a note making clear that the rules about http service properly belong in the RFC defining the Mime types text/xml and application/xml, and that when those RFCs are available their text will supersede the recommendations of the appendix. The wording given in the posting of 31 August will be changed by replacing the phrases 'XML document or entity' and 'XML document' with the phrase 'XML entity'. (It has been argued that the term 'entity' is not currently well defined in the XML spec; if the usage of the term is later revised, this occurrence may be changed.) In favor: all present. 2. A decision on case-folding was postponed again. A summary of the issues and a request for discussion by the SIG will be posted shortly. 3. XML processors to normalize CR, LF, and CRLF to LF. Background: the current draft XML spec says nothing about whether or how XML processors or applications should normalize the common line-break sequences CR, LF, and CRLF. For normalization: since the three sequences are intended, in practice, to have the same meaning, they can be normalized without loss of useful information. If the XML processor does not normalize these sequences, every single downstream XML application will be forced to do so; experience shows that relying on them to do so will result in broken applications and inconsistent behavior. Against normalization: right now the spec has no concept of line or line break; there is no need to introduce one, so for the sake of economy (and clarity) none should be introduced. For normalizing to LF: thanks to C's standard IO model, it's what most program libraries provide, and thus what most programs and most programmers expect. For normalizing to CRLF: it's more consistent with the specifications governing the Web. Last time anybody looked at the ASCII spec, CRLF was the preferred form of this information. Against CRLF: specifications? On the Web? Decision: When an XML processor encounters any of the character sequences CR (UTF-16 x000D), LF (UTF-16 x000A), or CR LF (UTF-16 x000D x000A), the processor must pass a single LF character to the downstream application. (Note: this formulation of the decision presupposes that the set of information which XML processors may or must make visible to downstream applications will be described more fully than it is in the current draft spec. If the WG decides against such a description, this substantive decision will need to be expressed in some other form. If the processor disappears from the XML language specification, as has been proposed, this decision may be expressed as a constraint on whether the differences among line-break sequences in the input stream are 'visible' or 'significant'.) -C. M. Sperberg-McQueen University of Illinois at Chicago tei@u... From: Tim Bray <tbray@t...> Subject: XML WG decisions of Wed. Sep. 10 The XML WG met on Wed. Sep. 10th. Present: Bosak, Kimber, Murata, Clark, Sperberg-McQueen, Wood, Nava, Bos, Maler, Bray, Tigue, Maloney, Paoli, DeRose. Errors in discussion summaries are, as usual, mine. 1. Discussion of case sensitivity Few new arguments arose in the discussion of case sensitivity, aside from Steve DeRose's observation that disallowing case folding will, by removing the possibility that attribute values are case-folded, reduce the number of instances where the results of parsing can be affected by the presence/absence of a DTD. (Note that the handling of white space can still be affected in the case where attribute values are known to be tokenized, so the problem hasn't entirely gone away). This is a summary of points made in a brief last-chance-to-speak- your-mind go-around: For Case Sensitivity: - XML will rarely be created by hand and when it happens, it'll be by experts. - This is a chance to do the right thing early in XML's history and avoid living with a compromise forever. - Case folding is very easy to specify and to understand. - It would be nice to be able to map case-sensitive objects, for example DSSSL flow objects, to element types. - Internationalization experts are unanimously against folding. - Pleasant experiences with case-sensitive programming languages. - Casefolding problems are truly vile. - It will be easy to make XML processors recognize typical user errors and provide helpful error messages. For Case Folding: - It would be the right thing to do if we were starting from scratch, but it's too late now. - There will be serious difficulties dealing with the XML-in-HTML scenario. - It will make it impossible for HTML ever to be specified as an application of XML as opposed to SGML. - The XML spec has been out for nine months now; it's late in the game to be making this change. The Question: Modify the XML specification to achieve the effect of NAMECASE GENERAL NO in SGML. Yes: Bosak Kimber Murata Clark Sperberg-McQueen Nava Bos Bray Tigue Maloney Paoli DeRose No: Wood Abstain: Maler So XML is now case-sensitive. 1a: Since XML is case sensitive, we must specify the case of our keywords, i.e. <!ELEMENT or <!element. Names not recorded, vote was Upper: 7 Lower: 3 Abstain: 4 (In this vote, some of the abstains should be taken as don't-cares). 2. Chris Maden's suggestion that NOTATION System Identifiers should be mime types. The WG liked the idea, but declined to modify the spec to achieve tihs effect; among other things, URLs and mime types are not syntactically distinguishable. It was the feeling of the group that it would be desirable that a new URL scheme be created to allow a URL to locate a mime type. 3. Discussion of the proposition that the XML spec should say more about what the processor passes the App. John Tigue has volunteered to write an XML Grove Plan; while there is little sentiment that this should be made normative, it might serve usefully as either a separate application note or an appendix. The WG agreed that the editors should enrich the language of the spec sufficiently to make it clear (as it does with PIs and comments) what a processor may and must make available to an application. Cheers, Tim Bray tbray@t... http://www.textuality.com/ PS: For your amusement, I attach the output produced by a moments-ago-updated Lark when asked to process the XML spec: Loading Testing: Lark V0.92 Copyright (c) 1997 Tim Bray. All rights reserved; the right to use these class files for any purpose is hereby granted to everyone. Parsing... Syntax error at line 127:57: Start/End tags differ only in case: p/P Syntax error at line 367:23: Start/End tags differ only in case: ITEM/item Syntax error at line 369:51: Start/End tags differ only in case: ITEM/item Syntax error at line 370:69: Start/End tags differ only in case: item/ITEM Syntax error at line 454:4: Start/End tags differ only in case: P/p Syntax error at line 457:50: Start/End tags differ only in case: p/P Syntax error at line 750:50: Start/End tags differ only in case: termdef/TERMDEF Syntax error at line 752:34: Start/End tags differ only in case: lhs/LHS Syntax error at line 755:71: Start/End tags differ only in case: prod/PROD Syntax error at line 955:43: Start/End tags differ only in case: P/p Syntax error at line 956:7: Start/End tags differ only in case: ITEM/item Syntax error at line 959:19: Start/End tags differ only in case: p/P Syntax error at line 959:26: Start/End tags differ only in case: item/ITEM Syntax error at line 991:7: Start/End tags differ only in case: list/LIST Syntax error at line 1031:22: Start/End tags differ only in case: P/p Syntax error at line 1039:4: Start/End tags differ only in case: p/P Syntax error at line 1062:4: Start/End tags differ only in case: P/p Syntax error at line 1137:31: Start/End tags differ only in case: p/P Syntax error at line 1140:4: Start/End tags differ only in case: p/P Syntax error at line 1207:4: Start/End tags differ only in case: P/p Syntax error at line 1278:4: Start/End tags differ only in case: P/p Syntax error at line 1289:60: Start/End tags differ only in case: p/P Syntax error at line 1453:7: Start/End tags differ only in case: DIV2/div2 Syntax error at line 1544:4: Start/End tags differ only in case: P/p Syntax error at line 1586:4: Start/End tags differ only in case: P/p Syntax error at line 1652:14: Start/End tags differ only in case: P/p Syntax error at line 1655:19: Start/End tags differ only in case: p/P Syntax error at line 1675:4: Start/End tags differ only in case: P/p Syntax error at line 1706:22: Start/End tags differ only in case: P/p Syntax error at line 1721:36: Start/End tags differ only in case: p/P Syntax error at line 1726:45: Start/End tags differ only in case: P/p Syntax error at line 1935:40: Start/End tags differ only in case: P/p Syntax error at line 2072:4: Start/End tags differ only in case: P/p Syntax error at line 2376:8: Start/End tags differ only in case: SCRAP/scrap Syntax error at line 2377:4: Start/End tags differ only in case: P/p Syntax error at line 2438:8: Start/End tags differ only in case: SCRAP/scrap Syntax error at line 2530:7: Start/End tags differ only in case: div3/DIV3 Syntax error at line 2595:8: Start/End tags differ only in case: SCRAP/scrap Syntax error at line 2665:10: Start/End tags differ only in case: p/P Syntax error at line 2858:7: Start/End tags differ only in case: DIV2/div2 Syntax error at line 3650:19: Start/End tags differ only in case: p/P Done. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|