Thanks for taking the time to think about it. Amelia A Lewis wrote: > On Thu, Mar 20, 2003 at 11:01:26AM -0800, Paul Prescod wrote: > > I don't quite understand how this is to work. > > The algorithm describing how one can understand the xml declaration *before* > the encoding is known (decoding both the character encoding and the fact > that this is XML at the same time) depends upon the magic-ness (as in > /etc/magic) of the string "<xml", which must appear at position 0, unless it > is preceded by one of 0xFEFF 0xFFEE. There is still a "known prefix". It is "<?". My back-of-the-envelope thinking says that this is enough. My logic goes like this. Start here: http://www.w3.org/TR/REC-xml#sec-guessing Most of the encodings discover the "base" encoding (ASCII-based, EBCDIC-based, two-byte, four-byte, big or little endian) before they get to the "xm" part of "<?xml". The ones that go the whole distance do so just to use 4-bytes as the rest do. As soon as you see 3C 3F ("<?") you know that you're working with something ASCII based. That said, I don't claim to know anything about EBCDIC or any really "out-there" encodings. But if I can handle all of the UCS's, UTF's and ASCII-pluses I think I've hit the 95/5 point easily. > The ability to figure out the encoding is dependent upon the restriction of > the identifier to a known set. XML parsers are, then, simply *verifying* > that this is XML as they discover the encoding, not solving for two > variables at once. Similarly, XDH processors are verifying that they are dealing with XDH data. XDH's first four bytes are _almost_ as regular as XML's. And, I believe, regular enough. > It seems to me that if you don't have a magic sequence, you have a much more > difficult problem; you can't figure out whether this: <?kzy irefvba="1.0" > rapbqvat="ebg13" ?> is XML or the "kzy" media type unless you already know > the encoding; you can't learn the encoding unless you know what the media > type is (so you can figure out that it's been rotated, in this case). I don't think my solution will be able to handle truly bizarre encodings (like rotated text) but I don't think XML does either. I could define a Unicode encoding that makes "<?xml" look like EBCDIC or ASCII and yet is not EBCDIC or ASCII. <?xml version="1.0" encoding="funkazoid"?> <QDOCTYPE ...> <Q-- In funkazoid, "Q" and "!" are swapped. --> The underlying question, which is worth struggling with, is whether to restrict the set of encodings to ones I know I can deal with or just let the market handle weird ones. XML's gotten by suprisingly well with being liberal. There is really not a big constituency out there for idiosyncratic encodings. > You might be able to make something out of the existence of the "/" in the > media type, but I have some doubts, because the length of the type > designation is variable. You might be able to specify that the media types > have to be ASCII, but that's awkward for the EBCDIC crowd, and quite > possibly for others as well (quite a few encodings *do* use ASCII as the > bottom 7 bits, after all, so perhaps it would be okay, as long as we don't > mind marginalizing (further) the ones that *don't*). I believe that the first two bytes will reliably be "4C 6F" in EBCDIC. And we can use the BOM for the various UTF's and UCS's...and even handle documents without the BOM as XML does. As an aside, given that the mapping from bytes to characters is completely undefined, and not even required to be a proper N-bytes->char mapping, a cynic could make a case that a GIF is XML in a very compressed encoding (without even resorting to saying "well really its an infoset"). As long as there could be a program that translates the bits into Unicode characters, its "an encoding." But that really isn't very interesting in practice. AFAIK, an encoding is just a function and some functions are very complicated. Paul Prescod
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format