[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: rss regularis(z)ation
bryan wrote: > Again the most serious RSS problem for me is escaped html, as such it > indicates a necessary wrong in any technology, as you are always going > to require a method for escaping characters reserved for your > technology. I found escaped HTML in RSS to be mostly an aesthetic problem (in that it deeply offends my aesthetic sensibilities :-), but not too hard to process. Feed the element content into a tag-soup parser, infer start- and end- tags to turn it into a tree, and strip out all the elements you don't want showing up in the aggregator output. Took me about two hours to code this up (to be fair, I did use an off-the shelf lexer for the first step). The biggest problems I've had with RSS have to do with inconsistent usage. For instance: some people put a summary, abstract, or lead paragraph in the <description> as was intended, others put the whole damn entry in there, complete with fifteen paragraphs, two bulleted lists, and six pictures of their cat. Is <dc:creator> the author's email address, full name, or a user ID? I've seen half a dozen different formats for dates (and many feeds don't even include them, which is a real PITA since I want everything sorted reverse-chronologically.) Then there are the encoding problems. HTTP Content-Type header says "text/plain" with no ";charset=" parameter (implying us-ascii), the XML declaration says "utf-8", but it's actually in iso8859-1. Variations on this theme abound. Those are minor annoyances, that don't greatly affect functionality. <link> vs. <guid> is another matter. Some feeds put the URL of the item itself in the <guid>, and the URL of the thing the item is talking about in the <link>. Others only use <guid> and don't include a <link>. Most, however, put the URL of the item in the <link> and an opaque ID in the <guid>. The distinction can, of course, be determined by the "isPermaLink" attribute on <guid>; if it's "true", or omitted, then the <guid> is the real link and the <link> is... well, something else. If it's "false", then the <link> is the real link and the <guid> can be ignored. However, there are differences of opinion on how to capitalize the attribute. Some spell it "isPermaLink" (Bactrian), others spell it "isPermalink" (Dromedary). Not a problem for the DPH regexping his way through, but if you're using a real XML processor adapting to case-insensitivity is a bit tedious. (For the record: according to Winer's spec the correct spelling has two humps.) Now as far as a headline browser is concerned, the item's URL is arguably the most important bit of information about it. It shouldn't take so much effort to locate it. --Joe English jenglish@f...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|