[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] FW: FW: Partyin' like it's 1999
Forwarded for Rick Jeliffe: From: Rick Jelliffe [mailto:ricko@a...] Derek Denny-Brown wrote: > XML 1.0 had surrogate pairs. The Unicode 2.0 code-point space was > 32-bit. But XML Unicode 2.0 and Unicode 3.0 did not define any characters in those code points. So the presence of any surrogate characters in names was an error and did not need to be checked by any special mechanism. For data, there are no standard mappings from user-defined characters points in East Asian regional encodings to Unicode non-BMP PUA, so any use of those in data would be unreliable except by luck within the same platform (I believe MS has its own mappings to the BMP PUA but I am not aware that this involves non-BMP and therefore, potentially, surrogates.) XML 1.0 only mentions surrogates to say that they are not characters (i.e. that character in XML is Unicode Character not UTF-16 code) and does not require that surrogates pair be checked. Note that XML 1.0 says "It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding." but, for UTF-16, that just requires going from bytes to surrogates, not from surrogates to non-BMP or checking that entities don't start with combining characters. > I over simplified my description of what characters are allowed in > names, agreed. I have been told by some of our customer reps that the > allowed character range for names has been a blocking issue for some > Asian customer. It may be that it is only one key character which is > causing the problem, or it may be an entire class of characters. I only > know that it is blocking customer adoption. The only thing I can think of is the Unicode presentation forms, from U+F9000 to U+FFEE. This notably includes full-width Latin and halfwidth katakana. However, note that the JIS technical note http://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/ "Those digits, Latin characters, and special characters of [JIS X 0208] which are also specified by [JIS X 0201] are deprecated. Likewise, Halfwidth Katakana of [JIS X 0201] are deprecated." <geek>There is debate over whether the half and fullwidth forms really represent different characters (in the context of data that can be marked up), and if they are different whether they cause problems and should be deprecated anyway, or whether they analogy is with upper and lower case, it is the user's problem to figure out when one is used an not the other. At the time, there was a comment that because different UIMs defaulted to full- or half-width forms differently, there would be more incompatability caused by allowing them into names than not. Unicode 2.0 is clear that the presentation forms are provided to allow round-tripping of texts, even though in Unicode terms they are the same characters. Note that the latest version of W3C's Character Model says http://www.w3.org/TR/2004/WD-charmod-20040225/#sec-Compatibility says "Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define." Other useful specs include http://www.w3.org/TR/unicode-xml/ which says how to treat characters in data (is says "in markup", but it really means "in marked-up data content" rather than, for examples, "in tags".) XML 1.1 removes checking those, adding them to the characters that should be voided in names (becasue they have canonical decompositions). </geek> All that being said, it is difficult to see how this issue could "block" the use of XML by anyone however. And, as can be seen, it is part of a larger issue that even JIS has been involved in. John Cowan wrote >Rick Jelliffe wrote: >> As another matter, Derek mentions spurious whitespace nodes. But if >> using a DTD (and validating parser) these nodes will not >> be generated. > They'll still be generated, because conforming XML parsers have to return > all character content. But they will be marked as element content > whitespace. > For example, the SAX callback "ignorableWhitespace" (which means ignorable > by applications that wish to ignore it, not ignorable by parsers) is the > SAX way to signal that the returned character content is element content > whitespace. I mean nodes in the DOM: use setIgnoringElementContentWhitespace(true) if this is an issue. It clouds the issue to say this is an XML problem when it is the responsibility of the API (and hence the programmer's control) to decide. I know that people are talking about "XML" including APIs, architectures and products, as a milleau not just the wording of the XML REC. But it is a fallacy to list problems of "XML" considered broadly then switch to "XML" considered as a limited spec: "I had a problem with API XXX; API XXX uses XML; therefore XML is bad". If the problem is that there are too many SAX and DOM properties, features and options, then that is not XML's problem: maybe APIS should just have a big switch "DATA|DOCUMENT|ROUND_TRIP" to keep things simpler for programmers: DATA to strip out all insignificant whitespace and non-leaf text nodes containing only whitespace. DOCUMENTS to strip out insignificant whitespace ROUND-TRIP to preserve everything Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|