[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Storing illegal XML 1.0 characters in the Unicode Private Use Area

  • From: "Costello, Roger L." <costello@mitre.org>
  • To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
  • Date: Wed, 31 Oct 2012 18:04:33 +0000

Storing illegal XML 1.0 characters in the Unicode Private Use Area
Hi Folks,

Here are the hex values for the Unicode characters that are permitted in XML 1.0 documents:

Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Notice that the hex values from E000 to F8FF are legal XML characters.

Interestingly, the hex values from E000 to F8FF have no characters assigned to them. That region is called the Private Use Area (PUA). 

Also notice that the hex values 0-8,B-C,F-1F are not legal XML 1.0 characters.

Suppose you are dealing with an application that emits text and some of the text contains characters that are illegal in XML 1.0. If you were to blindly wrap that text in markup and hand it to an XML parser, the parser would give an error saying that the document contains illegal characters.

So what do you do?

One approach is to move any illegal characters into the Private Use Area: for each illegal character add hex E000. Thus,

    map hex 0 to E000
    map hex 1 to E001
    map hex 2 to E002
    map hex 3 to E003
    ...
    map hex 1F to E01F

So this text (2 denotes hex two, 3 denotes hex three):

    2Hello World3

is converted to this XML:

   <text>&#xE002;Hello World&#xE003;</text>

Applications that process the XML document must be smart enough to subtract E000 from all the character entity references that are in the Private Use Area.

Interestingly, the Microsoft Visio application uses the approach described above [1].

    Any other ASCII control character between 
    ASCII 0 and ASCII 31 (excluding ASCII 9, 10, 
    and 13) is considered an illegal Unicode 
    character by some XML parsers. As a result, 
    these characters are translated into special 
    character values in the Unicode Private Use 
    Area. The Private Use Area begins at 0xE000. 
    ASCII control characters are offset by the 
    value 0xE000 when emitted to XML for Visio. 
    Therefore, if a Visio shape's text contained 
    the character ASCII 11 (Hex 0x0B), it is 
    emitted as 0xE00B.

/Roger

[1] http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.