|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: BOM and encodings questions
Hello, Why is there a contradiction between BOM and UTF-8 encoding in the same XML document? Appendix E.1 of xml 1.1 standard explains how to "guess" the encoding using BOM. I also didn't find any case other than external entities, but I can understand how someone will create an XML in encoding X but the data of some element <foo> will be in encoding Y, because this is a excerpt from a text file in some other encoding. It is fairly easy to implement a parser that is able to handle alternating encoding that can support such cases, but I couldnt find this mentioned anywhere in the standard(s). I get to see a lot of XML documents that contain alternating encodings -- are they not well formed? If so, then well formedness is probably very much misunderstood when it comes to character encodings... in my opinion. Shlomo. -----Original Message----- From: Philippe Poulard [mailto:Philippe.Poulard@s...] Sent: ä 08 îøõ 2007 19:22 To: Shlomo Yona Cc: xml-dev@l... Subject: Re: BOM and encodings questions Shlomo Yona wrote: > .1. > > If an XML document starts with the FF FE BOM (UTF-16, little endian) but > the encoding is set to UTF-8 in the prolog, what is the expected > behavior of the Parser? > > I think that the parser should respect the BOM, read the prolog assuming > it is encoded in UTF-16 little endian and then process the remaining of > the XML document in UTF-8 as the prolog says. > > Is this correct? I'm not sure, but a BOM can't be used with UTF-8, so the parser should fail to decode the prolog, as the characters expected should be UTF-16 encoded : "<?xml " would be interpreted as 3 characters > > .2. > > Is an XML parser expected to process a document in alternating > encodings? I mean, is there a way to signal the parser that from a > certain point on the encoding changes to some other encoding? If so, how? the only case I know is with external entities : each can have its own encoding that may be different from the document's one > > .3. > > Is there a way to express the expected encoding of the XML document in > the XML Schema? If so, how? too late : XML Schema works at the logical level I don't know why you try to enforce an incoming document to be encoding with a given one, let the parser do the job and fail normally if it is not supported However, a SAX parser can supply informations about the encoding of a document, so you can write a filter like this : if encoding != THE_ENCODING then fail_for_an_obscure_reason() endif -- Cordialement, /// (. .) --------ooO--(_)--Ooo-------- | Philippe Poulard | ----------------------------- http://reflex.gforge.inria.fr/ Have the RefleX !
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








