[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Unicode and attribute URI values?

  • To: xml-dev@l...
  • Subject: Re: Unicode and attribute URI values?
  • From: "Felix Sasaki" <fsasaki@w...>
  • Date: Fri, 16 Sep 2005 15:58:18 +0900
  • Organization: W3C
  • User-agent: Opera M2/8.0 (Win32, build 7561)

uri characters
Hi,

This is an reply to the following message:

> As part of designing a digital publication open standard (OpenReader),
> we're now discussing the issue of allowed characters within URI
> attribute values in UTF-8 encoded XML documents.

> Reading XML 1.0 and RFC 3986, it is not at all clear (at least to me)
> what is allowed, or how much leeway exists. Specifically, when the
> attribute URI value includes non-ASCII characters (e.g., Greek
> characters), must these non-ASCII characters be percent-encoded in the
> attribute value (effectively "ascii-zing" the attribute value), or can
> the characters be kept natively encoded in the attribute value per the
> text encoding of the document?

> I guess this issue comes under the moniker "International URIs".

> Thanks.

> Jon Noring

Do you know RFC 3987? This is called "Internationalized Resource  
Identifiers" (IRI) and addresses maybe many of your problems.

http://www.ietf.org/rfc/rfc3987.txt

Section 6.3 of RFC 3987 says:

    Document formats that transport URIs may have to be upgraded to allow
    the transport of IRIs.  In cases where the document as a whole has a
    native character encoding, IRIs MUST also be encoded in this
    character encoding and converted accordingly by a parser or
    interpreter.  IRI characters not expressible in the native character
    encoding SHOULD be escaped by using the escaping conventions of the
    document format if such conventions are available. Alternatively,
    they MAY be percent-encoded according to section 3.1. For example, in
    HTML or XML, numeric character references SHOULD be used.  If a
    document as a whole has a native character encoding and that
    character encoding is not UTF-8, then IRIs MUST NOT be placed into
    the document in the UTF-8 character encoding.

    Note: Some formats already accommodate IRIs, although they use
    different terminology.  HTML 4.0 [HTML4] defines the conversion from
    IRIs to URIs as error-avoiding behavior.  XML 1.0 [XML1], XLink
    [XLink], XML Schema [XMLSchema], and specifications based upon them
    allow IRIs.  Also, it is expected that all relevant new W3C formats
    and protocols will be required to handle IRIs [CharMod].

So to answer your question (it is not at all clear (at least to me) what  
is allowed, or how much leeway exists.): It depends on the specific XML  
application what is allowed and what not, also whether e.g. escaping is  
necessary. Some of the applications rely on the escaping rules described  
in section 3.1 of RFC 3987.

Hope that helps. Best,

Felix

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.