[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Postel's law, exceptions


html no xml

Tim Bray wrote:
> On Jan 13, 2004, at 6:01 PM, Joe English wrote:
> > True, but the aggregator as a whole might still accept it --
> > possibly by noticing that the encoding is mislabelled and
> > munging the data into proper UTF-8 before passing it to the
> > parser.
>
> Wow, is there any software that actually does this?  I hadn't 
> encountered it.

My toy aggregator does, after a fashion.


> > In fact any aggregator that doesn't do something like this
> > is doomed to fail -- *nobody's* feed has the encoding labelled
> > properly.  (Well, maybe not "nobody", but certainly not very many.)
>
> On the contrary; the vast majority of them are correct. -Tim

That's not been my experience.  In a small sample of
the ~50 feeds I'm subscribed to, I find:

    5   with Content-Type: text/xml, no charset parameter [*],
        XML declaration claims to be UTF-8;

    5   with Content-Type: text/xml, no charset parameter,
        XML declaration claims to be ISO-8859-1;

    3   with Content-Type: text/html (!); charset="iso-8859-1",
        XML declaration claims to be "UTF-8";

    1   with Content-Type: text/html; charset="utf-8";
        XML declaration at least agrees about the "utf-8" part

    1   RSS feed with content-type: "text/html", no XML declaration

    1   with Content-Type: text/plain, no charset parameter [*],
        XML declaration says UTF-8;

    1   with Content-type: text/plain, no charset parameter,
        XML declaration says ISO-8859-1

    2   with Content-type: text/plain, no charset parameter,
        XML declaration says nothing (these might actually
        be correct, but probably only by accident since they
        happen to contain only 7-bit characters).

    1   with Content-Type: httpd/unix-directory (?!?)


[*] Which means either "US-ASCII" or "ISO-8859-1", depending
on which RFC you take as authoritative.

In 4 cases, the HTTP header and XML declaration agree
on utf-8, in 2 they agree on ISO-8859-1, and rest are
all "application/*".  Of the ones that agree, I can't
say for sure if they're accurate, since the feed itself
happens to contain only 7-bit data at the moment.

I'm surprised at the large number of feeds that don't
even get the *media type* right; I have serious doubts
about the accuracy of the charset parameter.


--Joe English

  jenglish@f...

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.