Re: question about generate-id()

Play the video

Subject: Re: question about generate-id()
From: ac <ac@xxxxxxxxxxxxx>
Date: Wed, 04 Aug 2010 22:10:47 -0400

Hi,

Except if you have other documents also referencing those ids ... in which case you may prefer to append timestamps (e.g. transformation timestamps), for example, to new generated ids and not change previously defined (e.g. authored or generated) ids.

Cheers,
ac

Le 2010-08-04 18:18, G. Ken Holman a icrit :

At 2010-08-04 17:39 -0400, steve.majewski@xxxxxxxxx wrote:
When our EAD/XML files are edited, we run them thru a stylesheet that
checks
that certain sections all have @id attributes, and if not, adds them
using generate-id().
That sounds risky to me. I tell my students one should be using generate-id() for *every* element with an @id and adjusting any @idref attributes to use the translated values. There is an infinitesimal but possible chance that an authored id attribute will match a generated id attribute.

The uniqueness of identifiers is guaranteed only when generate-id() is used for every identifier. This makes sense because generate-id() has no way of knowing which of your attributes are identifiers and which are not.

You've exaggerated the risk by running a document with generated identifiers through a process that again generates identifiers using the same implementation-defined algorithm. But you haven't protected the identifiers on the way in from the identifiers being generated the second time.
I've recently discovered that some of those files now have duplicate
ids.
I think we've had a misconception about the uniqueness of generated ids.
A closer reading of M.Kay's book, as well as searching this lists
recent archives
says that it's "guaranteed to be unique for every node that
participates in a given transform"
Additionally, in one of those other threads, Florent Georges wrote:
 Yes.  And it is guaranteed to generate always the same ID when
called on the same node.
I suspect that what was not explicitly stated but implied by that
clause is that
it means that it is unique for nodes *generated* in a given transform,
and
not including those ids that are passed thru and copied from the input
to the output doc.
False. Every time a tree is created, be it from the source tree, from a document() or doc() function, from a temporary tree variable, that tree will be made up of nodes. Every node across all trees in the one transformation will have a unique identifier. Said differently, no two nodes across all trees in the one transformation will have the same identifier.

But that is all. Nothing is said about what the user uses for identifiers in the authored content.
We have generated nodes id's from previous transforms. Usually, these
do seem
to be unique -- I suspect because of that additional condition above
about "same node".
I think the cases where we do have duplicates were when a new element
was inserted
above another of the same kind, with a previously generated id. This
new node -- although
having entirely different content -- is considered "the same node" in
the sense that
it has the same xpath, for example: /ead/archdesc/dsc/c01[1]/c02[1]
( the previous node, being "pushed down" to  //c02[2]  )
The uniqueness of nodes is *not* guaranteed from one transformation to the next. When you pass a document through a second transformation, the engine's determination of uniqueness starts from scratch, without any knowledge of any id values in your input document.

If you follow the scheme I tell my students, then you get back to being unique across all nodes ... the values simply change every time a transformation is performed.

Am I (finally!) understanding this correctly ?
I'm not sure as I didn't really understand your explanation because I could not correlate your uses of "usually" and "the cases where" and "these seem to".
Does the above sound like a reasonable and likely explanation of
what's happening ?
I think so if what you are finding is that:
<doc>
<section id="x">
</section>
<section>
<xref idref="x"/>
</section>
</doc>
... gets written out as:
<doc>
<section id="x">
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>
... which when you then add a new section:
<doc>
<section id="x">
</section>
<section>
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>
... gets transformed to become:
<doc>
<section id="x">
</section>
<section id="gen-e3">
</section>
<section id="gen-e3">
<xref idref="x"/>
</section>
</doc>
... because the new section is again the third element in the document ... and you have a duplicate. Note that my values for example here are invalid because a generated id cannot have a "-" but I'm using that to illustrate my point. Also, a poor algorithm since the text nodes are also nodes with unique identifiers. But this is just an example.

Now, if you follow my advice to students, then:
<doc>
<section id="x">
</section>
<section>
<xref idref="x"/>
</section>
</doc>
... gets written out as:
<doc>
<section id="gen-e2">
</section>
<section id="gen-e3">
<xref idref="gen-e2"/>
</section>
</doc>
... which when you then add a new section:
<doc>
<section id="gen-e2">
</section>
<section>
</section>
<section id="gen-e3">
<xref idref="gen-e2"/>
</section>
</doc>
... gets transformed to become:
<doc>
<section id="gen-e2">
</section>
<section id="gen-e3">
</section>
<section id="gen-e4">
<xref idref="gen-e2"/>
</section>
</doc>
.... and if the first section had moved, then the idref= would have also changed to be the new id= value for that first section. Every node with an ID gets written out not with the authored ID but with the generated ID ... and every IDREF gets written out with the generated ID of the node it points to.

This comes up also in my XSL-FO instruction, because when you are aggregating multiple XML documents into a single XSL-FO output, and you are dealing with user-authored id values, you cannot use them as is because the value space for each document is independent. It would be too easy for two documents to have the same ID, so you cannot put that ID into the XSL-FO because that would create a conflict.

So, by following my rule of thumb, *every* ID gets replaced with that node's generated identifier, and every corresponding IDREF gets replaced with the referenced node's generated identifier, then everything is safely identified across all documents being aggregated and there are no ambiguous references.

I hope this helps.

. . . . . . . . . . . . Ken
--
XSLT/XQuery training:   after http://XMLPrague.cz 2011-03-28/04-01
Vote for your XML training:   http://www.CraneSoftwrights.com/s/i/
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/s/
G. Ken Holman                 mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/s/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Current Thread
question about generate-id() steve.majewski@xxxxxxxxx - 4 Aug 2010 21:39:53 -0000 G. Ken Holman - 4 Aug 2010 22:31:41 -0000 ac - 5 Aug 2010 02:10:59 -0000 <=

<- Previous	Index	Next ->
Re: question about generate-i, G. Ken Holman	Thread
Re: question about generate-i, G. Ken Holman	Date
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >