[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: question about generate-id()

Subject: Re: question about generate-id()
From: ac <ac@xxxxxxxxxxxxx>
Date: Wed, 04 Aug 2010 22:10:47 -0400
Re:  question about generate-id()
Hi,

Except if you have other documents also referencing those ids ... in which case you may prefer to append timestamps (e.g. transformation timestamps), for example, to new generated ids and not change previously defined (e.g. authored or generated) ids.

Cheers,
ac



Le 2010-08-04 18:18, G. Ken Holman a icrit :
At 2010-08-04 17:39 -0400, steve.majewski@xxxxxxxxx wrote:
When our EAD/XML files are edited, we run them thru a stylesheet that
checks
that certain sections all have @id attributes, and if not, adds them
using generate-id().

That sounds risky to me. I tell my students one should be using generate-id() for *every* element with an @id and adjusting any @idref attributes to use the translated values. There is an infinitesimal but possible chance that an authored id attribute will match a generated id attribute.


The uniqueness of identifiers is guaranteed only when generate-id() is used for every identifier. This makes sense because generate-id() has no way of knowing which of your attributes are identifiers and which are not.

You've exaggerated the risk by running a document with generated identifiers through a process that again generates identifiers using the same implementation-defined algorithm. But you haven't protected the identifiers on the way in from the identifiers being generated the second time.

I've recently discovered that some of those files now have duplicate
ids.

I think we've had a misconception about the uniqueness of generated ids.
A closer reading of M.Kay's book, as well as searching this lists
recent archives
says that it's "guaranteed to be unique for every node that
participates in a given transform"

Additionally, in one of those other threads, Florent Georges wrote:

 Yes.  And it is guaranteed to generate always the same ID when
called on the same node.

I suspect that what was not explicitly stated but implied by that clause is that it means that it is unique for nodes *generated* in a given transform, and not including those ids that are passed thru and copied from the input to the output doc.

False. Every time a tree is created, be it from the source tree, from a document() or doc() function, from a temporary tree variable, that tree will be made up of nodes. Every node across all trees in the one transformation will have a unique identifier. Said differently, no two nodes across all trees in the one transformation will have the same identifier.


But that is all. Nothing is said about what the user uses for identifiers in the authored content.

We have generated nodes id's from previous transforms. Usually, these
do seem
to be unique -- I suspect because of that additional condition above
about "same node".
I think the cases where we do have duplicates were when a new element
was inserted
above another of the same kind, with a previously generated id. This
new node -- although
having entirely different content -- is considered "the same node" in
the sense that
it has the same xpath, for example: /ead/archdesc/dsc/c01[1]/c02[1]
( the previous node, being "pushed down" to  //c02[2]  )

The uniqueness of nodes is *not* guaranteed from one transformation to the next. When you pass a document through a second transformation, the engine's determination of uniqueness starts from scratch, without any knowledge of any id values in your input document.


If you follow the scheme I tell my students, then you get back to being unique across all nodes ... the values simply change every time a transformation is performed.

Am I (finally!) understanding this correctly ?

I'm not sure as I didn't really understand your explanation because I could not correlate your uses of "usually" and "the cases where" and "these seem to".


Does the above sound like a reasonable and likely explanation of
what's happening ?

I think so if what you are finding is that:


<doc>
<section id="x">
</section>
<section>
<xref idref="x"/>
</section>
</doc>

... gets written out as:

<doc>
<section id="x">
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>

... which when you then add a new section:

<doc>
<section id="x">
</section>
<section>
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>

... gets transformed to become:

<doc>
<section id="x">
</section>
<section id="gen-e3">
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>

... because the new section is again the third element in the document ... and you have a duplicate. Note that my values for example here are invalid because a generated id cannot have a "-" but I'm using that to illustrate my point. Also, a poor algorithm since the text nodes are also nodes with unique identifiers. But this is just an example.

Now, if you follow my advice to students, then:

<doc>
<section id="x">
</section>
<section>
<xref idref="x"/>
</section>
</doc>

... gets written out as:

<doc>
<section id="gen-e2">
</section>
<section id="gen-e3">
<xref idref="gen-e2"/>
</section>
</doc>

... which when you then add a new section:

<doc>
<section id="gen-e2">
</section>
<section>
</section>
<section id="gen-e3">
<xref idref="gen-e2"/>
</section>
</doc>

... gets transformed to become:

<doc>
<section id="gen-e2">
</section>
<section id="gen-e3">
</section>
<section id="gen-e4">
<xref idref="gen-e2"/>
</section>
</doc>

.... and if the first section had moved, then the idref= would have also changed to be the new id= value for that first section. Every node with an ID gets written out not with the authored ID but with the generated ID ... and every IDREF gets written out with the generated ID of the node it points to.

This comes up also in my XSL-FO instruction, because when you are aggregating multiple XML documents into a single XSL-FO output, and you are dealing with user-authored id values, you cannot use them as is because the value space for each document is independent. It would be too easy for two documents to have the same ID, so you cannot put that ID into the XSL-FO because that would create a conflict.

So, by following my rule of thumb, *every* ID gets replaced with that node's generated identifier, and every corresponding IDREF gets replaced with the referenced node's generated identifier, then everything is safely identified across all documents being aggregated and there are no ambiguous references.

I hope this helps.

. . . . . . . . . . . . Ken

--
XSLT/XQuery training:   after http://XMLPrague.cz 2011-03-28/04-01
Vote for your XML training:   http://www.CraneSoftwrights.com/s/i/
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/s/
G. Ken Holman                 mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/s/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.