[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

NameChar (was: Editing text)

  • From: David Megginson <ak117@f...>
  • To: xml-dev@i...
  • Date: Fri, 28 Nov 1997 07:30:19 -0500

Peter Murray-Rust writes:

 > I am writing an editor for JUMBO where I expect most of the characters like
 > '"<>& to have been converted into entities (e.g. &apos, etc.). [I do not
 > expect any raw <![CDATA[ sections in the text - they will have been
 > transformed by the parser. On the other hand there may be other entities
 > which have not been expanded (e.g. &foo;
 > My understanding of the spec [71] is that an entity is a Name and that Names
 > [4], [5] and [6] are constructed from letters, digits and numbers. In
 > determining whether something is an entity, I have to look for a string of
 > the form: '&'(Letter | '_' | ':') (NameChar)* ';'
 > NameChars are Digits, MiscNames and Letters.
 > Appendix B lists six and a half pages of potential NameChars for which
 > JUMBO has to test - is this correct? If so I have code of the form:
 > public boolean isNameChar(char ch) {
 >     return <six pages of conditionals>;
 > }
 > I assume there is no short cut...

I have not checked them for alignment, but there is a good chance that
you could use Java's built-in java.lang.Character.isLetterOrDigit()
predicate to eliminate most of it, something like this:

  public boolean isNameChar (char ch) {
    return java.lang.Character.isLetterOrDigit(ch) | isMiscChar(ch);

  public boolean isMiscChar (char ch) {
    switch(ch) {
    case '.':
    case '-':
    case '_':
    case ':':
      return true;
      return isCombining(ch) || isIgnorable(ch) || isExtender(ch);

  public boolean isIgnorable (char ch) {
    int c = (int)ch;
    return ((c >= 0x200c && c <= 0x200f) ||
            (c >= 0x202a && c <= 0x202e) ||
            (c >= 0x206a && c <= 0x206f));

  public boolean isExtender (char ch) {
    int c = (int)ch;
    switch (c) {
    case 0x00b7:
    case 0x02d0:
    case 0x02d1:
    case 0x0387:
    case 0x0640:
    case 0x0e46:
    case 0x0ec6:
    case 0x3005:
      return true;
      return ((c >= 0x3031 && c <= 0x3035) ||
              (c >= 0x309b && c <= 0x309e) ||
              (c >= 0x30fc && c <= 0x30fe));

  public boolean isCombining (char ch) {
    // lots of stuff

The only long one left is isCombining(), which I haven't bothered to
fill in.  Before anyone uses these, please check them against both the
XML spec and the Java Language Spec, to see if isLetterOrDigit()
really aligns properly.

 > I applaud the work of the WG on the Internationalisation and I don't want
 > to detract from it. What I would suggest is that because of the extremely
 > likelihood of error if individuals do try to hack their own isNameChar(),
 > and because if ever this list is revised software will be invalidated, that
 > the WG, or W3C or whoever, maintain an isNameChar() routine in the common
 > languages 
 > (C, C++, Java) so that we know we shall all be working with the same one.

Not a bad idea, but it is unlikely that everyone would want to use the
same one.  The fastest solution would be to maintain a static 65,536
(or at least 32,768) entry array, with bit flags for different
character properties.  That would be fine for big programs, but it
would kill Java applets and other size-sensitive applications unless
it were already built-into the Java environment.

All the best,


David Megginson                 ak117@f...
Microstar Software Ltd.         dmeggins@m...

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.