|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] URI references and UTF-8 based escaping
Section 4.2.2 of the XML 1.0 Recommendation (2nd Ed.) states that: 1. a SystemLiteral is a URI reference 2. a URI reference is defined by RFCs 2396 and 2732 It then goes on to provide informative information about URI syntax. It mentions UTF-8 based escaping of non-ASCII characters. However, RFCs 2396 and 2372 do not mandate UTF-8 based escaping. In fact, the decision about how to handle non-ASCII characters and how to communicate that information is left to the scheme specifications. (ref: RFC 2396 sec 2.1, toward the end of that section). For example, to find out how to handle non-ASCII characters in URIs that use the http: scheme, consult the HTTP specification. The URN spec mandates UTF-8 for urn: schemes, but this is not applicable to URIs in general. The HTTP spec does not address the issue at all, nor does HTML. Consequently, you'll find URL-encoding that is based on non-Unicode encodings, particularly in submissions of HTML form data from the major browsers. XML 1.0 (2nd Ed.) Errata E4 says: Replace the last sentence of the paragraph beginning with "URI references require encoding and escaping of certain characters." with the following: "The XML processor must escape disallowed characters as follows:" This clarifies that UTF-8 based escaping is required for the processing of SystemLiterals by XML parsers, and thus a SystemLiteral is a URI reference that always uses UTF-8 based escaping, rather than what the appropriate scheme spec may mandate or implicitly allow. Here is a scenario that illustrates how the assumption of UTF-8 based escaping could conflict with the URI spec's deference to the scheme specs: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE mydoc [ <!ELEMENT mydoc (#PCDATA)> <!ENTITY greeting SYSTEM "http://somewhere/getgreeting?lang=es&name=C%C3%A9sar"> > <mydoc>&greeting;</mydoc> The name César is represented here as C%C3%A9sar in the UTF-8 based escaping. But the getgreeting resource at http://somewhere/getgreeting is iso-8859-1 centric (as it is allowed to be) and is expecting to be able to interpret the escaped characters as iso-8859-1, not UTF-8 (since HTTP doesn't care). It returns an entity containing a localized greeting phrase, having interpreted the %C3%A9 as U+00C3 U+00A9: <?xml version="1.0" encoding="iso-8859-1"?> ¡Hola, César! ...and thus you end up with the contents of the mydoc element having César's name misspelled. In practice, I don't think it's a major issue, but it's something to be aware of. As always, please tell me if I'm full of crap. Thanks. - Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://skew.org/xml/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








