[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: BOM and encodings questions

  • From: "Shlomo Yona" <S.Yona@F...>
  • To: "Philippe Poulard" <Philippe.Poulard@s...>
  • Date: Thu, 8 Mar 2007 09:41:19 -0800

RE:  BOM and encodings questions
Hello,

Why is there a contradiction between BOM and UTF-8 encoding in the same XML document? Appendix E.1 of xml 1.1 standard explains how to "guess" the encoding using BOM.

I also didn't find any case other than external entities, but I can understand how someone will create an XML in encoding X but the data of some element <foo> will be in encoding Y, because this is a excerpt from a text file in some other encoding. It is fairly easy to implement a parser that is able to handle alternating encoding that can support such cases, but I couldn’t find this mentioned anywhere in the standard(s). I get to see a lot of XML documents that contain alternating encodings -- are they not well formed? If so, then well formedness is probably very much misunderstood when it comes to character encodings... in my opinion.

Shlomo.


-----Original Message-----
From: Philippe Poulard [mailto:Philippe.Poulard@s...] 
Sent: ä 08 îøõ 2007 19:22
To: Shlomo Yona
Cc: xml-dev@l...
Subject: Re:  BOM and encodings questions

Shlomo Yona wrote:
> .1.
> 
> If an XML document starts with the FF FE BOM (UTF-16, little endian) but 
> the encoding is set to “UTF-8” in the prolog, what is the expected 
> behavior of the Parser?
> 
> I think that the parser should respect the BOM, read the prolog assuming 
> it is encoded in UTF-16 little endian and then process the remaining of 
> the XML document in UTF-8 as the prolog says.
> 
> Is this correct?

I'm not sure, but a BOM can't be used with UTF-8, so the parser should 
fail to decode the prolog, as the characters expected should be UTF-16 
encoded : "<?xml " would be interpreted as 3 characters

> 
> .2.
> 
> Is an XML parser expected to process a document in alternating 
> encodings? I mean, is there a way to signal the parser that from a 
> certain point on the encoding changes to some other encoding? If so, how?

the only case I know is with external entities : each can have its own 
encoding that may be different from the document's one

> 
> .3.
> 
> Is there a way to express the expected encoding of the XML document in 
> the XML Schema? If so, how?

too late : XML Schema works at the logical level

I don't know why you try to enforce an incoming document to be encoding 
with a given one, let the parser do the job and fail normally if it is 
not supported

However, a SAX parser can supply informations about the encoding of a 
document, so you can write a filter like this :

if encoding != THE_ENCODING
then fail_for_an_obscure_reason()
endif

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.