[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Web Resource Identity

  • From: Walter Underwood <wunder@i...>
  • To: "xlxp-dev@f..." <xlxp-dev@f...>, xml-dev <xml-dev@i...>, lavoie@o..., frystyk@w...
  • Date: Fri, 28 May 1999 09:37:51 -0700

henry corporation
At 10:01 AM 5/28/99 -0500, Paul Prescod wrote:
> [...] Consider the following URLs:
>
>http://www.mitre.org/index.html
>http://www.mitre.org/
>http://www.mitre.org
>
>Do they refer to the same resource? Let's try the answer both ways:

Dublin Core defines a Resource Identifier element, which might
be a good place to start, especially because libraries already
have to deal with various kinds of identity (different bindings,
"copy 2", same text from different publishers). The HTTP 1.1 
entity tag is only meaningful for a single URL (at least in Draft 5),
so it is mostly useful to caches.

Also, some additonal complexities ...

Don't forget content negotiation. The content could exist
in many variants, with the entity delivered depending on 
the Accept*: headers in the request:

   format:    HTML, XML, MS Word, PDF
   language:  en-US, en-GB, fr, de, jp
   charset:   8859-1, UTF-8, EUC, Shift-JIS

and the URL is always the same.

Then there are dynamic pages--what is "identity" for a weather 
station? The page is in some sense "the same page", but the 
content depends on the temperature.

Duplicated content is a real issue. Our search engine detects
and rejects duplicates. The URL to unique document ratio is
usually between 1.5 and 2. We do this detection across web
servers, since we really only need to index one copy of an
organization's acceptable use policy or the GNU copyleft.
If a site has a CNAME, the entire site will be duplicates.

Finally, some systems ignore case in file names, and relative
URLs are resolved according to the URL you used in the GET,
so we see:

   http://www.corp.com/dir/index.html
   http://www.corp.com/DIR/index.html
   http://www.corp.com/dir/INDEX.HTML
   http://www.corp.com/DIR/INDEX.HTML
   http://www.corp.com/DIR
   http://www.corp.com/DIR/
   http://www.corp.com/dir
   http://www.corp.com/dir/

with combinatorial explosions on longer URLs. And a nightmare
for robots and caching proxies.

wunder
--
Walter R. Underwood
wunder@i...
wunder@b... (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.