Character encoding questions
> I was struck by the following sentence in the Microsoft XML White Paper: > > XML supports a range of encodings...subject only to the restriction > that an entire document must share the same encoding. > > My immediate reaction was that that wasn't correct, although the > definition of "document" above isn't obvious to me (for example, are > external entities part of a document?). However, when checking into the > XML April specification, I got in over my head. I am hoping that someone > here will help me out of my hole. > > If my XML document is a simple Unicode text file then I begin it like > the following > > a Byte Order Mark > <?XML version="1.0" encoding="ISO-10646-UCS-2"?> > ... > > with the Byte Order Mark being required even though an EncodingDecl is > used? (I would have said "yes" until I got to Appendix E "Autodetection > of Character Sets," which worries about detecting UCS-2 when there > is no Byte Order Mark.) Is the EncodingDecl necessary if the file > starts with a Byte Order Mark? > > Where can I have an EncodingPI? Section 4.3.3 talks about their being > "at the beginning of a system entity, before any other character data or > markup" but doesn't define "system entity" (perhaps one that has an > ExternalID that contains "SYSTEM"?). If my document references an > external entity, then I believe that the external entity must start > with an EncodingPI (see Appendix E "Autodetection of Character Sets") > if it isn't in UTF-8 or start with a Byte Order Mark. > In classical SGML this info is contained in the system declaration where one or more character sets can be declared and the control characters used to switch between them, using the ISO 2022 and related standard systems. These are read in before the dtd. However, if I understand the XML proposals correctly, they do not envisage a system declaration. The best info on system declarations are a white paper from omnimark and an article in TAG by Wayne Wohler. On character sets you might have a look at my article in CHUM a couple of years ago. I have a preprint in ps available by ftp if you want to see it. It does not have the character set tables which ISO claims the copyright for. With the implementation of unicode/ucs we don't need all those things with control characters which are too succeptible to corruption. All the characters you need (or almost all in my case) are in the new character set. The other option in classic SGML is to use a subdoc, but as far as I can remember it can contain its own dtd, but I don't think it can have a system declaration. My docs are at the office. > Harry Gaylord former chair TEI committee on character sets member ISO SC2 and NNI shadow committee > xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format