[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Postel's law, exceptions
Tim Bray wrote: > On Jan 13, 2004, at 6:01 PM, Joe English wrote: > > True, but the aggregator as a whole might still accept it -- > > possibly by noticing that the encoding is mislabelled and > > munging the data into proper UTF-8 before passing it to the > > parser. > > Wow, is there any software that actually does this? I hadn't > encountered it. My toy aggregator does, after a fashion. > > In fact any aggregator that doesn't do something like this > > is doomed to fail -- *nobody's* feed has the encoding labelled > > properly. (Well, maybe not "nobody", but certainly not very many.) > > On the contrary; the vast majority of them are correct. -Tim That's not been my experience. In a small sample of the ~50 feeds I'm subscribed to, I find: 5 with Content-Type: text/xml, no charset parameter [*], XML declaration claims to be UTF-8; 5 with Content-Type: text/xml, no charset parameter, XML declaration claims to be ISO-8859-1; 3 with Content-Type: text/html (!); charset="iso-8859-1", XML declaration claims to be "UTF-8"; 1 with Content-Type: text/html; charset="utf-8"; XML declaration at least agrees about the "utf-8" part 1 RSS feed with content-type: "text/html", no XML declaration 1 with Content-Type: text/plain, no charset parameter [*], XML declaration says UTF-8; 1 with Content-type: text/plain, no charset parameter, XML declaration says ISO-8859-1 2 with Content-type: text/plain, no charset parameter, XML declaration says nothing (these might actually be correct, but probably only by accident since they happen to contain only 7-bit characters). 1 with Content-Type: httpd/unix-directory (?!?) [*] Which means either "US-ASCII" or "ISO-8859-1", depending on which RFC you take as authoritative. In 4 cases, the HTTP header and XML declaration agree on utf-8, in 2 they agree on ISO-8859-1, and rest are all "application/*". Of the ones that agree, I can't say for sure if they're accurate, since the feed itself happens to contain only 7-bit data at the moment. I'm surprised at the large number of feeds that don't even get the *media type* right; I have serious doubts about the accuracy of the charset parameter. --Joe English jenglish@f...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|