[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: rss regularis(z)ation


rss ispermalink

bryan wrote:

> Again the most serious RSS problem for me is escaped html, as such it
> indicates a necessary wrong in any technology, as you are always going
> to require a method for escaping characters reserved for your
> technology.

I found escaped HTML in RSS to be mostly an aesthetic problem
(in that it deeply offends my aesthetic sensibilities :-),
but not too hard to process.  Feed the element content into
a tag-soup parser, infer start- and end- tags to turn it into
a tree, and strip out all the elements you don't want showing up
in the aggregator output.  Took me about two hours to code this up
(to be fair, I did use an off-the shelf lexer for the first step).

The biggest problems I've had with RSS have to do with
inconsistent usage.  For instance: some people put a
summary, abstract, or lead paragraph in the <description>
as was intended, others put the whole damn entry in there,
complete with fifteen paragraphs, two bulleted lists, and
six pictures of their cat.  Is <dc:creator> the author's
email address, full name, or a user ID?  I've seen half a
dozen different formats for dates (and many feeds don't even
include them, which is a real PITA since I want everything
sorted reverse-chronologically.)

Then there are the encoding problems.  HTTP Content-Type header
says "text/plain" with no ";charset=" parameter (implying us-ascii),
the XML declaration says "utf-8", but it's actually in iso8859-1.
Variations on this theme abound.

Those are minor annoyances, that don't greatly affect functionality.
<link> vs. <guid> is another matter.  Some feeds put the URL
of the item itself in the <guid>, and the URL of the thing
the item is talking about in the <link>.  Others only use <guid> and
don't include a <link>.  Most, however, put the URL of the item
in the <link> and an opaque ID in the <guid>.

The distinction can, of course, be determined by the
"isPermaLink" attribute on <guid>; if it's "true", or omitted,
then the <guid> is the real link and the <link> is... well,
something else.  If it's "false", then the <link> is the
real link and the <guid> can be ignored.

However, there are differences of opinion on how to capitalize
the attribute.  Some spell it "isPermaLink" (Bactrian), others
spell it "isPermalink" (Dromedary).  Not a problem for
the DPH regexping his way through, but if you're using
a real XML processor adapting to case-insensitivity
is a bit tedious.

(For the record: according to Winer's spec the correct spelling
has two humps.)

Now as far as a headline browser is concerned, the item's URL
is arguably the most important bit of information about it.
It shouldn't take so much effort to locate it.


--Joe English

  jenglish@f...

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.