[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Postel's "Law": A question for liberal parsers

  • To: <xml-dev@l...>
  • Subject: Postel's "Law": A question for liberal parsers
  • From: "Bob Wyman" <bob@w...>
  • Date: Thu, 15 Jan 2004 14:41:13 -0500
  • Importance: Normal
  • Reply-to: <bob@w...>

rss pubdate problem
    At pubsub.com, we read about a 100K rss feeds per day and build
"synthetic" feeds based on what we find. (i.e. you can, or soon will
be able to, ask for a custom feed to be generated that contains all
new references to "'Howard Dean' or 'Dr. Dean' or ..."). In the
process of reading all these feeds, we run across quite a bit of junk.
Some of it is non-well-formed XML, but a lot of it is simply failure
to comply with the alleged "specifications" for various versions of
rss. The problem for us is that our service consists of passing on the
items that we find. So, should we as an intermediary be passing on
badly formed chunks of rss (i.e. items) or should we be attempting to
clean them up? 
    If we pass on the bad stuff, we'll be accused by our clients of
creating badly formed RSS files. On the other hand, if we "clean up"
the stuff we find, we may find that the owners of the source feeds
object to our modifying what they published. Some may thank us for
fixing obvious problems, however, I'm nervous that one day one of our
"cleanup" routines will cause a semantic, not just syntactical, change
in the content... What should we do?
    "pubDate" in rss gives  a good example of the problem:
    In RSS 2.0, a pubData element is supposed to look something like
this:

        <pubDate>Thu, 15 Jan 2004 12:59:06 -0500</pubDate>

    However, we often see these elements arriving with clearly broken
content. For instance, we'll often see things like:

         <pubDate>Thu, 15 January 2004 12:59:06 -0500</pubDate>

    Should we consider the presence of "January" rather than "Jan" to
be an error? Or, should we silently clean it up and convert it to
"Jan"?
    What should we do with the following?

        <pubDate>Thu, 15 Janu 2004 12:59:06 -0500</pubDate>

    Should we consider "Janu" to be an abbreviation of "January"? Or,
should we think it is "June"? Should our logic depend on time of year?
(i.e. if closer to June than January, do one thing, if not, do the
other?)
    What should we do with a date that appears in this format?

        <pubDate>2004-01-11T14:04:00 -5:00</pubDate>

    This is not an RFC822 date, however, it is fairly easy to figure
out that it is a date... Should we convert it to RFC822 format? Or,
pass it along as we found it?
    What about the dates like those that appear in the feed at
http://www.theblackrepublican.net/rss.xml . They don't use the
optional pubDate field but do provide Dublin Core dates. However, they
encode them as follows:

        <dc:date>2004-01-15T08:33:00+-5:00</dc:date> 

    Notice the "+-" (i.e. these folk are a bit conflicted about what
time zone they are in... They can't decide if they are ahead or
behind...) Should we pass this on as "+5:00" or "-5:00" or just leave
it to clients to figure out what is meant?

    I would like to be "conservative" in what I generate, but the
problem is that as an intermediary, I'm being fed a lot of stuff that
was generated "liberally". So, I'm in a bind... One interpretation of
Postel's law would say that I should do my best to output proper RSS
V2.0 while being liberal about what I accept. However, another set of
rules (i.e. intermediaries should minimize how much they muck with
content passing through...) would force me to generate non-conforming
feeds. How do I solve this dilemma?

		bob wyman


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.