[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Postel's "Law": A question for liberal parsers
At pubsub.com, we read about a 100K rss feeds per day and build "synthetic" feeds based on what we find. (i.e. you can, or soon will be able to, ask for a custom feed to be generated that contains all new references to "'Howard Dean' or 'Dr. Dean' or ..."). In the process of reading all these feeds, we run across quite a bit of junk. Some of it is non-well-formed XML, but a lot of it is simply failure to comply with the alleged "specifications" for various versions of rss. The problem for us is that our service consists of passing on the items that we find. So, should we as an intermediary be passing on badly formed chunks of rss (i.e. items) or should we be attempting to clean them up? If we pass on the bad stuff, we'll be accused by our clients of creating badly formed RSS files. On the other hand, if we "clean up" the stuff we find, we may find that the owners of the source feeds object to our modifying what they published. Some may thank us for fixing obvious problems, however, I'm nervous that one day one of our "cleanup" routines will cause a semantic, not just syntactical, change in the content... What should we do? "pubDate" in rss gives a good example of the problem: In RSS 2.0, a pubData element is supposed to look something like this: <pubDate>Thu, 15 Jan 2004 12:59:06 -0500</pubDate> However, we often see these elements arriving with clearly broken content. For instance, we'll often see things like: <pubDate>Thu, 15 January 2004 12:59:06 -0500</pubDate> Should we consider the presence of "January" rather than "Jan" to be an error? Or, should we silently clean it up and convert it to "Jan"? What should we do with the following? <pubDate>Thu, 15 Janu 2004 12:59:06 -0500</pubDate> Should we consider "Janu" to be an abbreviation of "January"? Or, should we think it is "June"? Should our logic depend on time of year? (i.e. if closer to June than January, do one thing, if not, do the other?) What should we do with a date that appears in this format? <pubDate>2004-01-11T14:04:00 -5:00</pubDate> This is not an RFC822 date, however, it is fairly easy to figure out that it is a date... Should we convert it to RFC822 format? Or, pass it along as we found it? What about the dates like those that appear in the feed at http://www.theblackrepublican.net/rss.xml . They don't use the optional pubDate field but do provide Dublin Core dates. However, they encode them as follows: <dc:date>2004-01-15T08:33:00+-5:00</dc:date> Notice the "+-" (i.e. these folk are a bit conflicted about what time zone they are in... They can't decide if they are ahead or behind...) Should we pass this on as "+5:00" or "-5:00" or just leave it to clients to figure out what is meant? I would like to be "conservative" in what I generate, but the problem is that as an intermediary, I'm being fed a lot of stuff that was generated "liberally". So, I'm in a bind... One interpretation of Postel's law would say that I should do my best to output proper RSS V2.0 while being liberal about what I accept. However, another set of rules (i.e. intermediaries should minimize how much they muck with content passing through...) would force me to generate non-conforming feeds. How do I solve this dilemma? bob wyman
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|