[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: An XML document is not well-formed if encoding="..." does

  • From: Hermann Stamm-Wilbrandt <STAMMW@de.ibm.com>
  • To: "Costello, Roger L." <costello@mitre.org>
  • Date: Sat, 29 Dec 2012 03:13:21 +0100

Re:  An XML document is not well-formed if encoding="..." does
Roger,

running the modified file through an identity transform will result in
the error you searched for, see below. Reason is that "70" is not a
valid 2nd byte for UTF-8 encodings, these are of the form "10xxxxxx".
http://en.wikipedia.org/wiki/Utf-8#Description

But you do not have a guarantee that failure happens.
Take for example this two character sequence "ä", it is "C3 A4" if
encoded in ISO-8859-1. If you now do your "utf-8" encoding
modification experiment, then this two bytes will be interpreted as
valid UTF-8 two byte encoding of "ä" character.


$ od -Ax -tcx1 Lopez.modified.xml
000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
        3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
000010   .   0   "       e   n   c   o   d   i   n   g   =   "   u   t
        2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  75  74
000020   f   -   8   "                       ?   >  \n   <   N   a   m
        66  2d  38  22  20  20  20  20  20  3f  3e  0a  3c  4e  61  6d
000030   e   >   L 363   p   e   z   <   /   N   a   m   e   >  \n
        65  3e  4c  f3  70  65  7a  3c  2f  4e  61  6d  65  3e  0a
00003f
$


$ xsltproc identity.xsl Lopez.modified.xml
Lopez.modified.xml:2: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF3 0x70 0x65 0x7A
<Name>L�pez</Name>
       ^
unable to parse Lopez.modified.xml
$
$ saxon-6.5.5 Lopez.modified.xml identity.xsl
Error at byte 10 of file:/home/stammw/Lopez/Lopez.modified.xml:
  Error reported by XML parser: bad continuation of multi-byte UTF-8
sequence (code: 0x70)
Transformation failed: Run-time errors were reported
$
$ xalan identity.xsl -IN Lopez.modified.xml

(Location of error unknown)XSLT Error
(javax.xml.transform.TransformerException):
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
Exception in thread "main" java.lang.RuntimeException:
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
	at org.apache.xalan.xslt.Process.doExit(Unknown Source)
	at org.apache.xalan.xslt.Process.main(Unknown Source)
$
$ cat identity.xsl
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:copy-of select="."/>
  </xsl:template>

</xsl:stylesheet>
$


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


|------------>
| From:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"Costello, Roger L." <costello@mitre.org>                                                                                                |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>,                                                                                         |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |12/28/2012 09:39 PM                                                                                                                      |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  | An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document,     |
  |right?                                                                                                                                   |
  >-----------------------------------------------------------------------------------------------------------------------------------------|





Thanks Chris for pointing us to that article: XML on the Web has Failed

I am making my way through it.

This statement in the article piqued my interest:

    ... determining the actual character encoding of an
    XML document is a prerequisite for determining its
    well-formedness ...

I decided to do an experiment.

I created this XML document and encoded each character in the document
using the iso-8859-1 encoding and in the encoding="..." I asserted that I
am using the iso-8859-1 encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<Name>López</Name>

I checked the document for well-formedness and the XML parser said it is
well-formed.

Good.

Then I changed encoding="iso-8859-1" to encoding="utf-8":

<?xml version="1.0" encoding="utf-8"?>
<Name>López</Name>

I checked it for well-formedness and the parser said it is still
well-formed.

Huh?

Shouldn't I have gotten a well-formedness error?

I did my experiment using the latest version of Oxygen XML. I think that it
uses the Xerces XML Parser, right?

Is this a bug in Xerces?

/Roger



_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.