[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Allowed characters for NCName

  • From: Rick Jelliffe <rjelliffe@a...>
  • To: Desmond Kirrane <desmond.kirrane@g...>
  • Date: Fri, 14 Dec 2007 01:27:27 +0000

Re:  Allowed characters for NCName
XML allows "native language markup", which is where element and attribute names can by and large use the typical graphical characters used for words in any native language. So Chinese can have element names entirely with ideographic characters and so on. (The only proviso is that spaces and apostrophe's are not possible.)

This idea was adopted into XML following the ERCS principles of the CJK DOCP group, an ISO-liaison expert group made up from standards people, industry and academics from East Asia, in the mid 1990s (I wrote it.) After XML, the principles have been consolidated by W3C and Unicode in a joint technical report concerning characters suitable for use in markup.

So Turkish dotless i is certainly allowed as an XML name character. (In fact, it is also the main reason why XML is case-sensitive, IIRC: it means there is no nation-neutral case-mapping strategy for A-Z.)

One possible reason there may a complaint about that character is if you are using the wrong encoding declaration.  Your document should be using UTF-8, or 8859-9 (or 8859-3, or CP1254 etc).  Many character sets do not have enough redundant code-points to allow incorrect labeling to be determined (for example, between the 8859-n character sets). So the strict naming rules of XML 1.0 serve as a back up to detect when code comes through that is not allowed as part of a name: it is a sign that there has been a bug or data corruption and prevents further infection.

When looking at character encoding, the golden rule is USE A HEX EDITOR. Don't open a file in some vanilla text editor unless you are really clear what encodings it reads, how it handles fonts, and what input mappings it may perform.

Cheers
Rick Jelliffe


Desmond Kirrane wrote:
5585ca8d0712130334k111cb306wa5f13b91c3518c70@m..." type="cite">Hi,

In my xml docs I have an atrribute of type xs:NCName.

When validating the xml against a schema the Turkish lower case i Character: ý is not allowed in the attribute.

From the XML Schema recommendation here http://www.w3.org/TR/xmlschema-2/#NCName
i know that:

NCName          ::=     (Letter | '_') (NCNameChar)*
NCNameChar     ::=     Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender

My questions are:
1. Is Letter = any letter in the English Alphabet (of any case)?
2. What are the CombiningChar(s)?
3. What are the Extender(s)?
4 Obviously Digit = numbers (0-9).



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.