[Home] [By Thread] [By Date] [Recent Entries]

  • From: Lars Marius Garshol <larsga@g...>
  • To: "'xml-dev@x...'" <xml-dev@x...>
  • Date: Mon, 17 Jul 2000 14:16:52 +0200


* Tim Crook
|
| I was looking around to see if there might have been a particular
| reason why expat was implemented such that no leading white space is
| allowed before the standard <?xml version="1.0" ?> line. 

The reason is that the XML recommendation requires it. :-)

| From my understanding of things, the Byte Order Mark is what allows
| an XML parser to determine which character set in use. 

Not really. It allows a parser to determine whether UTF-16 was used,
and if so which variety of UTF-16 (BE or LE). However, if UTF-16 is
not used then the encoding can basically be anything.

| (see Appendix F, Autodetection of Character Encodings in
| http://www.w3.org/TR/REC-xml) If the Byte Order Mark is not found,
| shouldn't the starting content of the data stream be discarded until
| the Byte Order Mark is located?

If the BOM is not at the beginning of the data stream then there most
likely isn't one, for example because iso-8859-1 was used. This is
what makes it so handy that the XML declaration must appear first in
the document if it appears at all.

The rules then become something like:

 a) does the stream begin with a BOM? if yes, assume UTF-16
 b) does the stream begin with an XML declaration (in some encoding
    that the parser is able to figure out)? if yes, see what the
    encoding pseudo-attribute says.
 c) assume UTF-8

--Lars M.


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member