[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Web Resource Identity
At 10:01 AM 5/28/99 -0500, Paul Prescod wrote: > [...] Consider the following URLs: > >http://www.mitre.org/index.html >http://www.mitre.org/ >http://www.mitre.org > >Do they refer to the same resource? Let's try the answer both ways: Dublin Core defines a Resource Identifier element, which might be a good place to start, especially because libraries already have to deal with various kinds of identity (different bindings, "copy 2", same text from different publishers). The HTTP 1.1 entity tag is only meaningful for a single URL (at least in Draft 5), so it is mostly useful to caches. Also, some additonal complexities ... Don't forget content negotiation. The content could exist in many variants, with the entity delivered depending on the Accept*: headers in the request: format: HTML, XML, MS Word, PDF language: en-US, en-GB, fr, de, jp charset: 8859-1, UTF-8, EUC, Shift-JIS and the URL is always the same. Then there are dynamic pages--what is "identity" for a weather station? The page is in some sense "the same page", but the content depends on the temperature. Duplicated content is a real issue. Our search engine detects and rejects duplicates. The URL to unique document ratio is usually between 1.5 and 2. We do this detection across web servers, since we really only need to index one copy of an organization's acceptable use policy or the GNU copyleft. If a site has a CNAME, the entire site will be duplicates. Finally, some systems ignore case in file names, and relative URLs are resolved according to the URL you used in the GET, so we see: http://www.corp.com/dir/index.html http://www.corp.com/DIR/index.html http://www.corp.com/dir/INDEX.HTML http://www.corp.com/DIR/INDEX.HTML http://www.corp.com/DIR http://www.corp.com/DIR/ http://www.corp.com/dir http://www.corp.com/dir/ with combinatorial explosions on longer URLs. And a nightmare for robots and caching proxies. wunder -- Walter R. Underwood wunder@i... wunder@b... (home) http://software.infoseek.com/cce/ (my product) http://www.best.com/~wunder/ 1-408-543-6946 xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|