[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: xml:base and fragments

  • From: "Andrew S. Townley" <ast@atownley.org>
  • To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
  • Date: Wed, 10 May 2017 20:29:23 +0200

Re:  xml:base and fragments
> On May 10, 2017, at 2:28 PM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> 
>> On May 10, 2017, at 5:49 AM, Andrew S. Townley <ast@atownley.org> wrote:
>> 
>> ...
>>> On May 10, 2017, at 5:37 AM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
>>> 
>>> 
>>> Does that mean you believe they are not present within the document 
>>> currently loaded?   
>>> 
>>> Or that the document currently loaded is not identified by the URI from
>>> which it was loaded?
>> 
>> Neither.
>> 
>> In the first instance, I have no idea if the identified fragment ‘#node1’ is actually present in the document or not.  The spec says it should be and that I can’t go looking for it elsewhere because it’s a fragment.
> 
> I think there is a slight slip in your paraphrase.  The spec does not say 
> it “should” be in the current document.  The spec says that the reference 
> is “defined as” pointing to a fragment in the current document.

That “the target of that reference is defined to be within the same entity (representation, document, or message) as the reference” does not guarantee that the target of that reference actually *exists* within the entity in question.

No slip, accidental or otherwise. :)

>> 
>> The only things I do know from the XML I wrote in both examples and the interpretation of both RFC 3986 (all sections) and the XML Base Recommendation is that:
>> 
>> 1) there is an abstract or physical resource identified by the sequence of characters "http://www.microsoft.com/bar.xml”
>> 
>> 2) the sequence of characters "http://www.microsoft.com/bar.xml” should be the base URI used when evaluating all relative URI references within the scope of the element e having an xml:base attribute value equal to this sequence of characters
> 
> I have no idea what operations you have in mind as the meaning of 
> the term “evaluating”.  So on this you get a free pass; you could mean
> absolutely anything, including assigning a Wine Spectator-style 
> score to them.

The intent was “evaluating” is a combination of parsing, identification and “resolution” per 5.1.

> If you mean that the base URI you mention is the base URI used
> when resolving the relative reference, then yes, I don’t recall
> anyone in the discussion suggesting anything different. 

Actually, accurately ascertaining precisely what the appropriate base URI and resulting URI should be for any given situation is precisely the question that caused me to join this discussion in the first place.

> 
>> 
>> 3) there i an abstract or physical resource identified by the sequence of characters "http://www.microsoft.com/bar.xml#node1”
>> 
>> 4) using the fragment rules defined by Section 3.5, the abstract or physical resource identified by the character sequence "http://www.microsoft.com/bar.xml#node1” is a secondary resource of the abstract or physical primary resource identified by the character sequence "http://www.microsoft.com/bar.xml”
>> 
>> 5) both the abstract or physical resource identified by the character sequence "http://www.microsoft.com/bar.xml” and the abstract or physical resource identified by the character sequence "http://www.microsoft.com/bar.xml#node1” pass the Same-Document Reference tests specified by Section 4.4 and referencing Section 5.1 using the base URI "http://www.microsoft.com/bar.xml” determined according to the precedence rules defined in Section 5.1
>> 
>> 6) As a result of #5, subsequent dereferences to either URI should not result in a new retrieval action for the abstract or physical resource having a base URI identified by the character sequence "http://www.microsoft.com/bar.xml”
> 
> Yes, so far.
> 
>> 
>> Note that #6 may not be the behavior expected by the content author,
> 
> Yes; clearly there is a non-empty set of people who do not want both
> inference 2 and inference 6 in your list, and also two non-empty sets of
> people, some of whom think it obvious that inference 2 must take
> precedence and some that inference 6 must take precedence, and who
> are unprepared to entertain the notion that both are prescribed by the
> spec and both thus apply.  
> 
>> but it is, to the best of my understanding, what is implied by the RFC.  
> 
> And mine.  (And, I think, others’ as well.)
> 
> It may be worth pointing out that if the current document has the document
> URI http://example.com/doc.xml, then 6 seems to some readers to entail
> the proposition that the reference in question (whose absolute form is 
> http://www.microsoft.com/bar.xml#node1) identifies a fragment named
> “node1” in the current document, and thus necessarily identifies
> the fragment http://example.com/doc.xml#node1.  

And now we circle back to the original “both” premise that sparked my disagreement.  I can assure you that I am not one of the readers in the camp you mention in the above paragraph.

If a base URI was not specified, then yes.  With the specification of a base URI, no, because the base URI and the URI used to retrieve the entity MUST be taken to identify different abstract or physical entities until given sufficient information by an authority recognized by the owner of the user agent processing the resource that both URIs are, in fact, intended by someone other than the creator or operator of the user agent to identify the same resource (HT John Cowan for prompting a more precise and robust answer re: identity).

>> This also means that for the purposes of processing the document and resolving references to "http://www.microsoft.com/bar.xml” within the scope of element e, there is no way to trigger actual retrieval of this resource through link targets with this character sequence.  
> 
> I don’t know what you mean; if you mean that inferences 2 and 6 together
> seem to suggest that software can choose either to dereference the 
> refernce to #node1 by retrieving http://www.microsoft.com/bar.xml#node1
> or by looking for node1 within http://example.com/doc.xml, and that the
> author has in this case no reliable way of forcing one choice rather than 
> another, then I agree.  (The author does have a reliable way, of course:
> writing what the author means.)

No. I mean that as a result of 2 & 6, the software “should not” initiate another retrieval action since according to Section 4.4 the URI within the element qualifies as a “same document” reference.

The implications in light of XML Base seem to be that dereferencing the empty relative reference, e.g. base URI must return the element itself rather than any other potential octets and meta-data resulting from dereferencing the URI in the absence of an explicit xml:base attribute value locking you into Section 4.4’s “same document” reference semantics.

A content author might assume in the absence of the discussion and depth of analysis we’ve been undertaking regarding Section 4.4 that:

<e xml:base=“http://example.com/xyzzy”>
  <bar href=“” />
</e>

would represent the same behavior as:

<e>
  <bar href="http://example.com/xyzzy”/>
</e>

Which, due to Section 4.4, it clearly does not.  Section 4.4 along with Section 5.1 lock you into the scope of the element content in the first instance, but allow actual retrieval actions of the href attribute value in the second.

Without going into more detail, I can think of several cases where machine-generated content may result in the first form unintentionally when the second form was actually the desired outcome.  This is the simplest case that illustrates the point.

> 
>> The only way to actually trigger retrieval of this resource would be to provide a reference to the "http://www.microsoft.com/bar.xml” URI outside the scope of element e.
> 
> Uh, I think there is a slip here.  Being outside element e surely doesn’t
> guarantee that the base URI is not http://www.microsoft.com/bar.xml;
> if we are in an HTML document, the ‘base’ element may specify that as
> the base URI; if we are in an XML document, there may be another
> xml:base attribute.  But with that trivial proviso, yes, I think you are correct
> here.

And this is the scenario I was describing where it does.

In the general case, no, it does not guarantee that the reference outside the scope of element e wouldn’t have the same base URI.  However, in my examples, which must be taken as intended to represent the entirety of the resource content for the example in question, it most certainly would.

We aren’t in an HTML document in any of my examples.  With HTML, the 5.1 extension point implementation according to the HTML specification only allows one effective BASE element to be interpreted—the first one.  Any others are ignored, and the one that applies applies to the entire scope of the document.

In HTML, therefore, there’s no way to “escape” the same-document reference scenario defined in Section 4.4 with URI fragment identifiers.

Since XML Base allows you finer grained control over the scope of applicability, you have a way to author your way out of the Section 4.4 box by applying or omitting relevant xml:base attributes at various locations in the resource, as I illustrated in my original examples.

> 
>> 
>> What is present in the “document current loaded”, e.g. the octet stream (we don’t have the metadata),
> 
> I’m not sure why we don’t have “the” metadata.

Because we didn’t actually retrieve the documents in the above examples.  We only described it, and we didn’t describe the associated metadata.  So, no, we don’t have it.

If we had dereferenced the URI in question using a user agent by triggering a retrieval action, then we would have the associated metadata.

In this example, we only have the representations of the resource content I created and the URI provided by Michael as the original example.

> 
>> is a set of identifiers which, by defining them within the content of that octet stream, gives them – the identifiers, and, by indirection the resources they identify – form and meaning in the universe because we have identified them, and we’re referencing them within the octet stream itself.
>> 
>> What is NOT present in the “document currently loaded”/octet stream is the corresponding octet stream(s) and meta-data that may result from any such retrieval action performed by dereferencing those identifiers independent of the same-document tests and I believe referencing one of Michael’s original concerns.
>> 
>> Due to Section 4.4 and the criteria for “same-document”, the referenced #node1 target is expected to appear somewhere as a child of element e, but, again, due to external factors and potential commissions by the content author, it may not, in fact, be present.
> 
>> 
>> Stated more simply, I may talk about a nice, juicy, tart and perfectly crisp apple within the context of this email, but that doesn’t mean either of us get to eat it.  We only have an identifier for it.  We don’t have the result of retrieving that identifier, and, due to practical issues relating to embedding fruit as attachments in email, said apple does not appear in the text of the email itself.  Further, if we had a base URI for this email and defined the reference to the apple in terms of it, we’d be denied access to the apple forever because we “should not” make additional attempts to find it by resolving the reference.
>> 
>> In the second instance, there is no question that if any of the example documents I created were the result of dereferencing the character sequence "http://www.saxonica.com/foo.xml" for a retrieval action, then those octet sequences and metadata would be a physical resource identified by the character sequence "http://www.saxonica.com/foo.xml.”
>> 
>> The question still remains, apparently, what is the correct character sequence that should be the value of the base URI within the scope of the element e.
> 
> ?!  

And I said this because you repeatedly referred to the URI used to retrieve the entity resource as the base URI in spite of content-level, explicit specifications to the contrary.

If not, then this question is closed, but also given your “many camps” description above, perhaps it is still not addressed satisfactorily.

At this stage, my inclination is to leave it to the pundits to make their own judgments, but that’s how the thread started in the first place.  I don’t think either one of us is in the position of authority to definitively and canonically annotate or further elucidate the text of the RFC and XML Base.  We can only provide commentary and analysis to convince ourselves of a position and potentially, as a side effect, others who might find this discussion at some future state.

> 
>> 
>> However, I don’t think this question is ambiguous or hard to answer if you fully consider all of the text within the RFC and the XML Base Recommendation and connect the dots according to the rules defined therein.
>> 
>> The only possible value for the base URI used within the scope of the element e MUST be "http://www.microsoft.com/bar.xml” because the specs say so. Not me, not you, not Roy, not God, not Allah, but just those two documents, working together in the way that they were designed to allow future extensibility and still define a coherent architecture in the cases where that extension point was not required or used within the given resource being processed.
> 
> This peroration suggests that you believe you are arguing against 
> people who deny that the relevant base URI is the one you mention
> or who deny that the absolute form of the reference is the one you
> mention.  I think if you reread carefully the writings of your interlocutors
> you will find that this is not true.  

Actually, the point of my peroration was that it wasn’t about what anyone but the text of the spec says.  I’m not arguing against anyone.  I’m trying to communicate my understanding of the way the layered specifications work in practice and posit that the specification, on its own, is lucid in explaining how to identify the appropriate base URI to use and what assumptions you can make and actions that you can take in using that base URI to resolve relative URI references appearing in arbitrary content, past, present and future.

> As for Roy Fielding, I asked a question about your interpretation of
> RFC 3986, trying to understand what position you are trying to take.
> If you choose to misunderstand my question, I don’t suppose I can
> force you to answer.

Your question was based on an assertion of mine resulting from an inappropriate assumption about loading the document.  I already agreed that my assumption was incorrect, so I don’t have a way to answer the question you asked, which I’m presuming was this one:

>> So are you saying that Roy Fielding was wrong when he responded 
>> in [1] to Paul Grosso’s inquiry in [2] by saying that in cases analogous
>> to the one MK describes, the reference is a same-document reference
>> and can and should be retrieved from the document that has 
>> been already loaded (in MK’s case the document a www.saxonica.com)?
>> 
>> [1] https://lists.w3.org/Archives/Public/uri/2004Jan/0009.html
>> [2] https://lists.w3.org/Archives/Public/uri/2004Jan/0007.html

> 
>> 
>> I fail to see how RTF has a different interpretation in the exact case I described than I do.  It’s a pretty clear, one-word answer.
>> 
>> The rest of his answer is commentary on why this is true. What he says doesn’t change the fact that if a base URI is defined within the content of a resource, then relative references within a given resource within the scope of that base URI definition must be interpretative as relative to the defined base URI.
> 
> No.  The rest of his answer includes an explanation of why the same-document
> rules allow the fragment in question to be retrieved from the current URI,
> as described above in your inference number 6.  That was what I understood
> you to be denying.

At this stage, I have no earthly idea what you mean by “current URI” in the above paragraph.  Is this the base URI, a resolved relative URI or the URI from which the resource was retrieved?

The only thing I was “denying” was that there was an additional required dereference retrieval action against the base URI discovered in the content so that Section 4.4’s “same document” semantics applied.

As stated 4? times now, that assumption was incorrect and the processing user agent must consider the base URI as relating to the current resource, regardless if said base URI has been dereferenced or not.

You don’t seem to like my sweeping generalizations, but after a careful re-read of both your references again, I still stand by my “the rest is commentary on why he said ‘Yes.’” response, because that’s ultimately what it is.

Since you again brought up the question of whether or not I agree with Roy on his response in [1] above, I will provide some thoughts which should be taken independently of anything else we’ve been talking about.

I do find it troubling that in trying to add precision Roy chose to respond in a way that leads to your original “both” premise, because at the point where the retrieval action occurs, per the definition of the RFC, the Fragment cannot be interpreted as part of the protocol or scheme.  The fragment must be interpreted by the user agent after the resource has been retrieved.

Given that there’s a sequence that takes place during the retrieval process, while the user (or developer) may have the impression that http://example.com/stat/doc.html#foo and http://www.example/com/stat/blargh#foo are the same, and, except from a technical perspective relating to HTTP and the RFC, they are as far as most users may be concerned.

However, since we’re talking in the land of RFC’s and specifications, these distinctions are material, so I will attempt to illustrate my perspective.

1. A URI dereference via retrieval action is requested for “http://www.example.com/stat/doc.html#foo” assuming this is not within the scope of 4.4 Same Document reference semantics.

2. The URI is parsed into its scheme representation according to Section 4.3:

>    Scheme specifications will not define
>    fragment identifier syntax or usage, regardless of its applicability
>    to resources identifiable via that scheme, as fragment identification
>    is orthogonal to scheme definition.

This means that the actual resource dereference retrieval action by the user agent uses the URI "http://www.example.com/stat/doc.html”

3. The octet stream and meta data is processed by the user agent to assemble a representation based on understanding the structure of the resource’s media type.

4. The resource is interpreted according to Section 5.1 to identify an appropriate Base URI for relative URI resolution according to Section 5.  The result of this interpretation (in this case), results in assigning the value of "http://www.example.com/stat/blarg” as the base URI for the loaded resource.

5. The original request URI is parsed for any fragment identifier that needs to be resolved as a relative secondary resource identifier within the primary resource loaded by the user agent having the base URI of http://www.example.com/stat/blarg”

6. Resolution of the ‘#foo’ fragment identifier takes place according to the rules of the RFC and the browser does not initiate a new retrieval action, doing whatever is appropriate to display the secondary resource to the user as defined by the media type specification

According to this sequence the original URI with fragment identifier http://www.example.com/stat/doc.html#foo is not dereferencable until the primary resource has been loaded.  The very process of dereferencing that primary resource defines a different base URI for resolution of the fragment portion of the URI than the URI from which the resource was loaded, so, technically, and I do mean, technically, the ‘#foo” secondary resource only exists as a secondary resource of the primary resource identified by the URI specified in the content as the base URI for fragment resolution according to Section 4.4.

There is no way, except accidentally, to interpret that the http://www.example.com/stat/doc.html#foo secondary resource, when dereferenced, actually exists as part of the primary resource http://www.example.com/stat/doc.html because the act of dereferencing the http://www.example.com/stat/doc.html URI hides the existence of this URI from the fragment resolution mechanism defined within the RFC itself.

So, from the *user* perspective, the http://www.example.com/stat/doc.html#foo secondary resource does, in fact, exist because the user sees it associated with this URI which they may see in their browser.  However, technically, and from the perspective of the wording of the RFC itself, it does not – it cannot – exist, because it is never possible to resolve the secondary URI fragment in relation to the primary URI from which the resource was originally loaded.

>> 
>> The only material difference between the case cited from 2004 and the examples I gave was the syntax used for specifying the desired base URI within the content.  In the case of the 2004 question, it was HTML’s BASE tag, and in the case of our discussion, it’s the xml:base attribute.
> 
> No.  In the 2004 question, Paul Grosso provided both HTML base and xml:base examples.

Ok. Yes.  I read that, but then was more focused on Roy’s response, which only included the HTML content reference and the XML URIs.  When I went back during the writhing of the reply, I forgot.
 
>> 
>> At this point, I also feel that I’ve illustrated more than once how the two specifications work together to clearly identify the value of a base URI that should be applied given knowledge of the content vocabulary and the value of any relative URI.
> 
> Yes, I think you have.  And MIchael Kay.  And a good many other people.
> 
> Since you began by suggesting that RFC 3986 was irrelevant to the case,
> I had then the impression that you thought what it said was of no concern
> in the example in question.
> 
> I’m happy to learn that I misunderstood your position.  

Fantastic.

To restate:

RFC 3986 defines the concept of a URI, a base URI, a relative URI and a mechanism for resolving relative URI references against a base URI.  The RFC also defines a future-focused extension point that allows content formats a first opportunity to specify what the value of a base URI should be when resolving relative references to any degree of granularity possible to specify in the definition of the content format itself.

XML Base and HTML, as content formats, both provide a mechanism for identifying the base URI to be used for relative URI resolution according to RFC 3986.  However, the mechanisms they choose to use to define the value of the base URI to be used by RFC 3986 complaint software is totally and completely orthogonal and independent to RFC 3986 except for two things:

1) said content specification must make a normative reference that it is able to provide base URIs according to the requirements specified in RFC 3986, and
2) said content specification must define the rules and scope in which the character sequences it provides according to this interface are to be used by RFC 3986.

I’d also like to state that the majority of the discussion seems to have been not on the identification of the appropriate base URI that should be used but the consequences and mechanics of using that base URI in conformance with RFC 3986.

The part of the discussion that I joined was originally focused on how the value of a base URI was to be established using xml:base specifically rather than how that value was used according to RFC 3986.  However, I’m happy, because I refreshed my understanding and clarified some edge cases that I hadn’t previously been forced to consider.  So, all good.

Cheers,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.