[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: xml over http - RFC 3023
Andrew Welch wrote: > Hi Rick, > > >> The out-of-band signalling of character encoding is a fundamentally broken >> idea, because there are no mechanisms for programs which generate data to >> memoize the character encoding used that can then feed the rest of the >> food-chain. >> > > How about the BOM - that's one way isn't it? I wonder if a similar > ignorable byte sequence could be added to the start of all byte > sequences to indicate the encoding of what's coming. > There is: it looks like this <?xml version="1.0" encoding="... This has the added advantage of being visible in text editors, unlike the BOM (usually). >>> At the moment it all seems pretty complicated... >>> > > >> It is not complicated. Use application/xml >> >> If you do find intermediate web systems that implement the ASCII default or >> the IS8859-1 default as anything other than 8-bit clean for text/xml submit >> a bug report. >> > > So this is a real test of XML on the web. The complicated part I was > referring to is reading the bytes from the http input stream in the > right encoding: > > - extract the encoding from the contenttype > - if its not there read the first few bytes of stream in us-ascii and > then extra the encoding from the prolog > - if its not there use utf-8 > - hope that actual encoding of the file and the encoding you've discovered match > > ...and that's not even completely correct as far as I understand. > For application/xml you ignore the first step and go straight to the document. If your data is usually in UTF-8 or ASCII, you could perhaps read in the first block from bytes to characters and (if the transcoder has not generated an exception) confirm that there is no XML encoding declaration or BOM or that the string "utf-8" does not appear in the XML encoding declaration, in which case you don't need to do anything more complicated. If your data is text/xml, you are indeed in a sea of complication, which is why text/xml has been discouraged for so long. The detection method is specified in appendix F of the XML spec. I have implemented it a couple of times. Many other people have implemented it. There is lots of code floating about. > So when you say: > > "It is not complicated. Use application/xml" > > I don't get it, what am I missing? > > I would've thought the webserver would be aware that it was serving > xml and take of it - it could extract the encoding from the xml prolog > and ensure the file was served with that (maintaining it however it > liked)... it seems odd that the client should go through this process > every time. Maybe, but the mechanism for this occur, for Apache at least, is for someone to write it, contribute it, champion it and maintain it. The reason webservers typically don't as I understand it, is that they are too busy to transcode: they need to transfer bits as fast as they can. This is the problem, typically the webserver is not set to correctly generate the header, and in nay case, how does the server know what the encoding of a particular files is? But the basic XML contract is that the encoding must be explicitly labelled by the sender (creator of the document) and the recipient should not guess but use the label. If this is too much for naive users, then XML is simply not the technology for them, and XML should not be blamed for not working in a situation it explicitly was designed to avoid. It is just like if someone does not know what + means they cannot use a calculator. It is not an indictment of mathematics if someone who does not know + cannot use a calculator. Character encoding is just as fundamental to computer programming as knowledge of the difference between floats and ints, for example: that Western computer science and IT courses have guaranteed the ignorance of their students in this is sad. In any case, I thought most people had written off RSS as unprocessable by generic XML tools, because so much RSS was not well-formed? I thought one reason for Atom was that the early RSS systems creators messed up their XML and RSS never recovered. With RSS, what you are not experiencing the failure of XML on the web, you may be experiencing the failure of non-WF XML (and the potential complexity of figuring out text/xml). Cheers Rick Jelliffe
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|