[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Storing illegal XML 1.0 characters in the Unicode PrivateU

  • From: Julian Reschke <julian.reschke@gmx.de>
  • To: "Costello, Roger L." <costello@mitre.org>
  • Date: Fri, 02 Nov 2012 15:39:18 +0100

Re:  Storing illegal XML 1.0 characters in the Unicode PrivateU
On 2012-10-31 19:04, Costello, Roger L. wrote:
> Hi Folks,
>
> Here are the hex values for the Unicode characters that are permitted in XML 1.0 documents:
>
> Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> Notice that the hex values from E000 to F8FF are legal XML characters.
>
> Interestingly, the hex values from E000 to F8FF have no characters assigned to them. That region is called the Private Use Area (PUA).
>
> Also notice that the hex values 0-8,B-C,F-1F are not legal XML 1.0 characters.
>
> Suppose you are dealing with an application that emits text and some of the text contains characters that are illegal in XML 1.0. If you were to blindly wrap that text in markup and hand it to an XML parser, the parser would give an error saying that the document contains illegal characters.
>
> So what do you do?
>
> One approach is to move any illegal characters into the Private Use Area: for each illegal character add hex E000. Thus,
>
>      map hex 0 to E000
>      map hex 1 to E001
>      map hex 2 to E002
>      map hex 3 to E003
>      ...
>      map hex 1F to E01F
>
> So this text (2 denotes hex two, 3 denotes hex three):
>
>      2Hello World3
>
> is converted to this XML:
>
>     <text>&#xE002;Hello World&#xE003;</text>
>
> Applications that process the XML document must be smart enough to subtract E000 from all the character entity references that are in the Private Use Area.
>
> Interestingly, the Microsoft Visio application uses the approach described above [1].
>
>      Any other ASCII control character between
>      ASCII 0 and ASCII 31 (excluding ASCII 9, 10,
>      and 13) is considered an illegal Unicode
>      character by some XML parsers. As a result,
>      these characters are translated into special
>      character values in the Unicode Private Use
>      Area. The Private Use Area begins at 0xE000.
>      ASCII control characters are offset by the
>      value 0xE000 when emitted to XML for Visio.
>      Therefore, if a Visio shape's text contained
>      the character ASCII 11 (Hex 0x0B), it is
>      emitted as 0xE00B.
>
> /Roger
>
> [1] http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx
> ...

Also used in 
<http://www.day.com/specs/jcr/2.0/3_Repository_Model.html#3.2.5.4%20Exposing%20Non-JCR%20Names>.

Best regards, Julian


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.