[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: xml:base and fragments

  • From: "Andrew S. Townley" <ast@atownley.org>
  • To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
  • Date: Tue, 9 May 2017 17:46:08 +0200

Re:  xml:base and fragments
Hi C.M,

I left this stew a bit because I got caught up in other things, and I wanted to try and give it a proper response rather than just getting lost down a rat hole.  The latter may happen anyway, but at least, I’m making an effort at the former…. ;)

I think before I dive in to the blow-by-blow questions below, I want to reset/establish some perspective here so we’re on the same page. I also want to say again that I’m interested in this to document/clarify my own understanding on how this should work because I do a lot of work where this type of resolution is pretty critical.

If we don’t agree on the scope of the discussion, it’s easy to get lost or get focused on individual responses rather than the content. Hopefully, we can agree on the context at least.

Here we go…

The primary issue here is about the resolution of relative URIs in various contexts. URIs in this case are defined by RFC 3986.

As the specification rightly points out, dealing with relative URIs involves two distinct operations, namely 1) establishing a “base URI” that is the point of reference for the relative URI component and, 2) resolving the base URI and the relative URI component into a complete URI that can be used as URIs are wont.

URIs are strings (character sequences) with special characteristics that are used to identify an abstract or physical resource.  In this role, they may be both content as well as locators for content.  They are content because they must be encoded and transferred between information sources and they are locators because they define a way to represent various locations within abstract or physical address spaces.

URIs may be included in any type of resource content, and, as a result, the location of the resource and the location of other resources referenced within that resource may be separated in time and space.

The specific portions of RFC 3986 that are most relevant to this discussion are Sections 4.4, Same-Document Reference, and Section 5.1 Establishing a Base URI.

The rest of Section 5 defines operations and examples on how to actually create a complete URI based on having a base URI and some URI fragment, so they’re independent of exactly HOW the base URI was identified in the first place.

The crux of this whole question/discussion seems to be Section 5.1 that describes 4 ways to establish a base URI. They are defined in precedent order, so as soon as you get a match for one, you have identified the single, effective base URI to be used according to the rest of Section 5.

The priority of establishing a base URI is:

1. Base URIs embedded in content
2. Base URI of an encapsulating entity (to be identified by recursively applying these rules within the encapsulating entity itself)
3. Establishing the base URI as the URI from which the resource being processed was retrieved
4. Establishing an application-specific, default base URI

Hopefully, we agree on the above, because that’s just the spec, and I’m sure we agree on points 4, 3 and possibly 2, depending on the structure and nature of the content.  In fact, there’s not really much difference between 1 & 2, because an encapsulating entity by its definition is content in which the relative reference are contained.

Hopefully, we also agree that when establishing a base URI, you only get one for any resolution pass through the above rules.

If you are unable to establish the semantics of the content in which the URI is embedded, and/or you are unable to establish the semantics of the encapsulating entity in which the URI is embedded, but you can still parse URIs (or what you believe to be URIs and URI fragments), it is *possible* to justify skipping steps 1 & 2 because you have an opaque content encoding, however the risk is that you will get it wrong.

In this case, the right thing to do as an application is to provide some kind of alert/warning to the user or agent acting on the absolute URIs you generate that you only partially understood the source resource in which the URI was embedded.

The better solution is to actually understand the syntax and semantics of the content in which the relative URI is embedded and encode that understanding into the logic and processing of the application charged with resolution of those relative references into absolute URIs.

Since RFC 3986 only describes the syntax and semantics of identifying abstract or physical resources, it can’t authoritatively say anything about how an application can determine a content author’s specification of a base URI in resolution step #1, nor can it authoritatively say anything about how to identify an appropriate and reasonable base URI location in relation to an encapsulating entity (Note that I state “content author” here because they are the ones in control of the content of the resource, even if that “content author” is an automated system).

All RFC 3986 can authoritatively discuss is how to determine a base URI from a URI associated with the resource entity being processed.  The rest must be defined by specifications of other things that work in conjunction to assist the processing application in identifying that first, crucial resolution step intended by the content author: the base URI of relative references embedded within that content.

RFC 3986 has therefore done it’s job, and while it says it is possible for authors and specification writers to provide authoritative specifications of what the base URI value should be within a given content or encapsulating entity format, RFC 3986 doesn’t know or care how that’s done at all.

What it says is “there may be a way”:

[RFC 3986] 5.1.1.  Base URI Embedded in Content

   Within certain media types, a base URI for relative references can be
   embedded within the content itself so that it can be readily obtained
   by a parser.  This can be useful for descriptive documents, such as
   tables of contents, which may be transmitted to others through
   protocols other than their usual retrieval context (e.g., email or
   USENET news).

   It is beyond the scope of this specification to specify how, for each
   media type, a base URI can be embedded.  The appropriate syntax, when
   available, is described by the data format specification associated
   with each media type.

And the last sentence is what brings us to the title of the thread.  The “appropriate syntax” in question is defined both by:

1. The XML Base Recommendation that defines a standardized way to express this concept within XML, and
2. The specific XML vocabulary making normative references to XML Base as the way to identify the author or publisher’s intended base URI value for documents encoded with that XML vocabulary.

The XML vocabulary referencing XML base defines which attributes and element content should be interpreted as URIs, and which of those (potentially all) are to be interpreted relative to the rules specified in the XML Base Recommendation.

It is for the above reasons that I stand by my original statement that the issue here does not, in fact, have anything to do with RFC 3986 for the simple reason that we’re talking about the syntax and semantics of *content*, not URIs.

According to Base URI Resolution Step 1 in RFC 3986, content-specific rules, if defined, take absolute precedence over any of the other potential approaches to establishing an appropriate value to be interpreted as a “base URI” in any other part of the RFC.

That means for applications that “understand” the syntax and semantics of content capable of defining a base URI, there is exactly one – and only one – possible value to be used as the base URI when attempting to resolve any relative URI into an absolute URI. According to the text of RFC 3986, there is no other possible way.

If you don’t understand the syntax and semantics of the content, you can’t actually identify which particular character-sequences are potentially relative URIs in the first place, so there’s no excuse for you to fall back to resolution step #3—unless the syntax and semantics of the content being processed DO NOT specify any way to allow the publisher or author to explicitly indicate a value for base URIs.

In the presence of a vocabulary with a normative reference to the XML Base Recommendation, you’re stuck with only one choice for establishing the value of a base URI: the mechanisms defined by the XML Base Recommendation.  You can’t change the rules and be compliant with RFC 3986 and the XML Base Recommendation because they are two parts of the same thing.

Hopefully, you’re still with me and, even if you don’t agree, you can understand where I’m coming from both with my previous comments and with the remainder of my responses to your in-line comments below.

> On May 8, 2017, at 12:37 AM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> 
>> On May 7, 2017, at 3:07 PM, Andrew S. Townley <ast@atownley.org> wrote:
>> 
>> ...
>> 
>>> On May 7, 2017, at 9:43 PM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
>>> 
>>> ...
>>> 
>>> Also — where on earth does this idea arise that the questions below relate to the interpretation of the
>>> xml:base spec?  The semantics which some people find unexpected is defined by RFC 3986, not
>>> by XML Base.  It is not the use of XML Base, but the use of URIs, that bring the semantics in question
>>> into one’s application.
>> 
>> I wouldn’t see it that way, but maybe it’s just me.
>> 
>> To my mind, the use of xml:base and the way that influences the interpretation of values within a give vocabulary is completely relevant because it’s the rules of XML base that construct the URIs to be resolved.  XML Base comes first, then 3986 comes second.
> 
> Very few people have specified what they think is a problem here, but I expect most people who are unhappy with the interpretations offered for various examples earlier in this thread are unhappy with the conclusion that given a document URI D, a base URI B, and a fragment-only reletive reference #R, #R has the absolute form B#R and also denotes the fragment R within the current document (so it denotes the same thing as D#R).  If this is true, then they are of course within their rights to be unhappy.  But the conclusion that #R identifies both B#R and D#R is not a function of one’s choice of base URI, or of the mechanism used to specify a base URI.  It is a function of the way the resollution rules in RFC 3986 define the interaction of document URIs, base URIs, and relative references.

Unless you and I have dramatically different interpretations of the way the resolution rules actually are intended to work, there is no “both”. There is only ever one base URI that may be used when resolving a given URI embedded in XML vocabularies compliant with XML Base, and that base URI is the most appropriate of the containing element, document or relevant external entity.

> 
>> 
>> More specifically, an application processing a document using xml:base must construct a URI on the basis of the current xml:base value, if any *before* that URI is to be resolved by any application processing the document.  
> 
> No.  “URI resolution” is the process of constructing an absolute URI from a relative URI by absolutizing it against the relevant base URI.
> 
> It is not to be confused with “dereferencing”, the process of performing some action (e.g. a GET or a PUT) on a URI.
> 
> And neither is to be confused with “retrieval”, which is the fetching of data from an external source.  (If a browser has already loaded a document, it does not need to retrieve the document again in order to dereference a fragment in the current document; it can and should simply scroll to the new location.)
> 
> The first two of these terms are defined and distinguished in RFC 3986 section 1.2.2, "Separating Identification from Interaction”.

Fair enough.  I wasn’t precise enough in my language.  Hopefully, this error has since been corrected in my response this time.  This was an attempt to try and re-phrase the precedence rules.

> 
>> Therefore, the result of applying XML Base is a fully-qualified URI, not a relative URI.
> 
> Yes, it is.  It is the fully qualified (= absolute) URI produced by applying the rules of RFC 3986 to the base URI and the relative reference.  We know it’s the result of the RFC 3986 rules, because those are the rules XML Base normatively refers to for creating an absolute URI from a relative reference, and the only ones specified in the XML Base spec.
> 
> Can you point to the words in XML Base which you believe explain how to produce this result and which do not (if I understand your claim correctly) appeal to any rules in 3986? 

Hopefully, I covered this in the introduction.  However, if not, it isn’t the rules of XML Base that explain it, it’s the resolution rules and precedence defined by RFC 3986 in Section 5.1.

> 
>> 
>> The issue here comes back to some of Simon’s comments just now about the needs of authors vs. application developers.
>> 
>> Again, maybe it’s just me, but xml:base exists to provide document authors a shorthand for URI references to elements that belong to the same URI structure. In this way, I see them much more like namespace prefixes than I do anything else, and that’s why I suggest the processing model above and also why I say that the way this question should be answered actually has nothing to do with 3986 at all.’
> 
> You are not alone, I think.  Many people do regard the point of in-content base URIs, as provided by XML Base and by the HTML ‘base’ element, in just that way.  Others, as you probably know, regard the purpose of in-content base URIs as being to provide a meaningful base URI in contexts where for some reason the retrieval URI is not otherwise available.  Since the editors of the relevant specifications include people of both opinions, it is not surprising if the specs occasionally exhibit an uneasy truce between the two positions.

Regardless of their purpose or intent, the important points are:

1) they are, in fact, provided within the content, and 
2) therefore, they must take precedence over any other possible base URI value according to RFC 3986.

The wording and rules in Section 5 are quite clear of this, and your second rationale for the purpose of in-content base URIs is a result of the practical problem-solving of content processing applications, and it has nothing to do with the concerns of the authors at all.

Let’s be clear:  if there’s a way to specify a base URI in a content vocabulary, it’s for the benefit of the content author, not the developer of the processing application. The processing application MUST take those author-specified base URIs into account as specified by both the vocabulary and URI construction/resolution rules in RFC 3986.

Also, please understand that I have and do wear both hats on an almost equal basis for a considerable amount of time (+20yrs). Therefore, I’m not particularly biased to either side of the house in this respect.

> 
> But it’s not clear to me why a technology like xml:base having a particular purpose leads you to believe that questions about the meaning of certain constructs like the two #a references in your earlier example should be answered without considering the normative text which assigns meaning to those constructs.  It happens that that normative text is found in RFC 3986.    

For the simple reason that the very RFC you keep quoting explicitly cedes defining of the meaning of the base URI to content formats *as a priority* according to Section 5.1:

5.1.  Establishing a Base URI

   The term "relative" implies that a "base URI" exists against which
   the relative reference is applied.  Aside from fragment-only
   references (Section 4.4), relative references are only usable when a
   base URI is known.  A base URI must be established by the parser
   prior to parsing URI references that might be relative.  A base URI
   must conform to the <absolute-URI> syntax rule (Section 4.3).  If the
   base URI is obtained from a URI reference, then that reference must
   be converted to absolute form and stripped of any fragment component
   prior to its use as a base URI.






Berners-Lee, et al.         Standards Track                    [Page 28]
RFC 3986                   URI Generic Syntax               January 2005


   The base URI of a reference can be established in one of four ways,
   discussed below in order of precedence.  The order of precedence can
   be thought of in terms of layers, where the innermost defined base
   URI has the highest precedence.  This can be visualized graphically
   as follows:

         .----------------------------------------------------------.
         |  .----------------------------------------------------.  |
         |  |  .----------------------------------------------.  |  |
         |  |  |  .----------------------------------------.  |  |  |
         |  |  |  |  .----------------------------------.  |  |  |  |
         |  |  |  |  |       <relative-reference>       |  |  |  |  |
         |  |  |  |  `----------------------------------'  |  |  |  |
         |  |  |  | (5.1.1) Base URI embedded in content   |  |  |  |
         |  |  |  `----------------------------------------'  |  |  |
         |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
         |  |  |         (message, representation, or none)   |  |  |
         |  |  `----------------------------------------------'  |  |
         |  | (5.1.3) URI used to retrieve the entity            |  |
         |  `----------------------------------------------------'  |
         | (5.1.4) Default Base URI (application-dependent)         |
         `----------------------------------------------------------'

> 
>> 
>> If you are expecting on 3986 to take care of this for you, then I don’t see the value of specifying and using xml:base at all, because to my reading, they apply different sets of rules.  
> 
> They define overlapping sets of rules:  since xml:base refers normatively to RFC 3986, the one set is a superset of the other.

Not really.  It provides a degree of modularity, not overlap.  The references to RFC 3986 from the XML Base Recommendation are there to strongly link its purpose to Step 1 of the resolution process and to provide additional context and examples so you don’t need to have both specs in front of you at the same time.

There’s no overlap because RFC 3986 explicitly prevents any conflict or confusion from happening through built-in extension points in resolution rules 1 and 2.

> 
> RFC 3986 defines a generic syntax for URIs and prescribes rules for reference resolution, including rules for establishing a base URI (which include the possibility of base URIs embedded in the content of the entity, for certain media types) and rules for resolving relative references against a base URI.  It also addresses a few other issues.
> 
> XML Base, by contrast, limits itself to providing one mechanism for embedding base URIs within content, suitable for XML documents (application/xml and other), just as various versions of the HTML spec define a different mechanism for embedding base URIs within content (for text/html and some other media types).  It refers to RFC 3986 for the explanation of what it means to have a base URI and how it affects the denotation of references.

Basically correct.  XML Base really serves the purpose to inform application developers who are processing content conformant with the XML Base specification specifically how to identify the author’s intended values for base URIs to be used when resolving any relative references according to the rules specified by RFC 3986 once a single base URI value has been established.

> 
> Without xml:base, an XML application would need to specify its own mechanism for embedding base URIs in content.  There’s nothing to make that impossible, but when xml:base provides an adequate mechanism, there’s no pressing need to invent another.

Exactly.  However, you also need to understand what this means in the context of RFC 3986, and this is the crux of where things seem to be stuck or confused.

> 
>> One is about creating URI values and the other is about resolving those.  In the context of this discussion, they are subtle but important differences.
> 
> The question with which the discussion began was, essentially, “What is identified by the target attribute in the following document?”, where the ‘target’ attribute is understood as containing a URI (here a relative reference) and the base URI is given by xml:base.
> 
>  <div xml:base="http://www.dictionary.com/a.html">
>    <p>
>      <ref target="#apple">Apple</ref>
>    </p>
>    ….
>  </div>
> 
> It is a question about the meaning of a given relative reference in a document.  
> 
> It is not a question about how best to support authoring, or how best to simplify implementations of URI-aware software, or how to make spec-casuists happy.  It is not a question about what network activity should or should not follow a request to dereference the URI “#apple”, and if it were, it would reduce to the question about what the reference identifies, because a conforming browser will dereference the URI by fetching (if need be) the object identified, and then displaying the relevant fragment (so the question about dereferencing behavior reduces to the question about meaning).

You’re right.  It isn’t about which set of stakeholder needs are being considered.  What it is about is understanding how layered specifications work together to solve problems in an extensible way.

The architecture of the RFC 3986 specification recognized that content formats evolve, so it allows the designer of the content format to create a way in which the content itself (as created by the author or publisher) can be the definitive, authoritative source of the base URI to be used when resolving relative references.

Both specifications do their part, and there’s really no question about how it should work if you carefully read and understand both specifications.

> 
> RFC 3986 and xml:base do indeed have importantly different roles here.  The xml:base spec tells us that the base URI to use in resolving “#apple” is http://www.dictionary.com/a.html, and RFC 3986 tells us (a) that the absolute form of the reference is http://www.dictionary.com/a.html#apple and (b) that the target of the reference is by definition contained within the current document and dereferencing it “should not” launch a new retrieval action. 
> 

They’re not different.  They are complimentary.

Your point b is just false.  The specification says nothing about “current document”, what it says in 3.5 Fragment that defines what a fragment identifier is and how it is used are stated in terms of primary and secondary resources, not current or non-current documents.

The relevant part is the first two paragraphs:

3.5.  Fragment

   The fragment identifier component of a URI allows indirect
   identification of a secondary resource by reference to a primary
   resource and additional identifying information.  The identified
   secondary resource may be some portion or subset of the primary
   resource, some view on representations of the primary resource, or
   some other resource defined or described by those representations.  A
   fragment identifier component is indicated by the presence of a
   number sign ("#") character and terminated by the end of the URI.

      fragment    = *( pchar / "/" / "?" )

   The semantics of a fragment identifier are defined by the set of
   representations that might result from a retrieval action on the
   primary resource.  The fragment's format and resolution is therefore
   dependent on the media type [RFC2046] of a potentially retrieved
   representation, even though such a retrieval is only performed if the
   URI is dereferenced.  If no such representation exists, then the
   semantics of the fragment are considered unknown and are effectively
   unconstrained.  Fragment identifier semantics are independent of the
   URI scheme and thus cannot be redefined by scheme specifications.

In your example, retrieval is at best tangential to the real issue of “what the hell are we talking about in reference to ‘#apple’?”  The answer is that the currently specified/resolved/intended/whatever base URI represents the primary resource, and the fragment identifier represents the secondary resource that “may be some portion or subset of the primary resource”, etc.

You could only actually be certain that “within the current document” was the case if you ignored resolution rules 1 and 2 of RFC 3986, which, according to the specification itself, you just can’t do.

Retrieval actions are relative to the primary resource, not the current document.  If you specify that the primary resource is not the current document, and you don’t already happen to have the content of the primary resource hanging around in memory, then it most certainly MUST result in another retrieval if you intend to resolve that reference.  There’s just no way around it—unless, your resolved, absolute URI of the URI fragment constructed based on the specified base URI *happens to be* the same as the URI used to retrieve the resource you are processing.

Then, and ONLY then, is your statement true because the URI of the current resource and the primary resource identified by the correctly generated absolute URI are the same, character-for-character.

> 
> To return to the topic with which this note started:
> 
> A number of people unhappy with being told both (a) and (b) have suggested on this list that they think xml:base has done something wrong; some have suggested that HTML ‘base’ does better.  Since the only role xml:base plays in this scenario is identifying the base URI and referring normatively to RFC 3986, and since a corresponding example is trivially constructible with HTML, I have thus far found their analysis unconvincing.  

It is understanding that they are unhappy being told that there is more than one way to interpret an explicit specification of a base URI.  There isn’t.  The RFC states this fact.

HTML base can’t by definition be better, because it too is specified within content as a way to satisfy the first resolution step of RFC 3986. In purpose and function in relation to resolving base URIs, they are the same. In practice, they define different semantics for different content vocabularies, many of which are often closely related.  That’s still irrelevant to the discussion.

Back to Michael’s original statement on the browser vendor’s implementations, you can’t always assume they’re correct, so just because an example works in HTML doesn’t mean that it’s following the specifications.  There are far too many broken web pages in the world, and browsers have made the marketing decision that they should “do their best” in the face of inconsistent content.

If you want to follow the specs, then you need to test against the normative content of the specifications themselves, not any particular implementation of those specifications because they may intentionally (design) or unintentionally (bugs) behave differently.

> For myself, I’m not sure I think there is a problem here.  The meaning of the construct is perfectly well defined; if you don’t want your references to have the double meaning of referring both to something in the current document and something in what you believe to be a different document, then it’s not hard to avoid constructing such references.  Other things being equal, having a less surprising meaning attached to the example would probably be preferable, but it is very very difficult to make rules for URI resolution which guarantee that “#foo” will always refer to the current document and which allow the in-content specification of arbitrary base URIs and which have no odd consequences or other blemishes.  
> 
> And in some cases, the ‘double’ interpretation is perfectly sane and sensible, not a problem at all.  In a driver document which uses XInclude to embed various subordinate documents, the line in the source that looks like
> 
>    <xi:include href=“http://www.dictionary.com/a.html”/>
> 
> might result in a div like that shown above in the output.  The reference “#apple” is, when it occurs within http://www.dictionary.com/a.html, definitely a reference to the fragment “apple” within that document.  Since that document has now been embedded in a larger document, it is equally definitely also a reference to the fragment “apple” in the current document, even if the driver file for the XInclude processing is at some very different URI.

In the above case, the resource containing the include directive would qualify as an encapsulating entity in Section 5.1 of RFC 3986, so there’s really no issue.  The relative URIs defined with HTML’s BASE element should be resolved independently of any references with the encapsulating entity.

Any scoping of the HTML BASE element in a.html should no longer apply outside the explicit scope of that resource.  Meaning if I had ‘<a name=“apple”>…</a>…<a href=“#apple”>..</a>’ in a.html and '<e xml:id=“apple”></e>…<f uri=“#apple”/>' in the containing document, each would be independently identifiable with corresponding absolute URIs that should not conflict.

> 
> ********************************************
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> cmsmcq@blackmesatech.com
> http://www.blackmesatech.com
> ********************************************
> 
> 
> _______________________________________________________________________
> 
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> 
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

--
Andrew S. Townley <ast@atownley.org>
http://atownley.org



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.