[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: is there a way to hash an element?
I agree with you, computing a hash and adding it as an attribute to the top level element, then grouping on the hash looks like a good strategy. I may have missed something in this thread, but I don't recall seeing a specification of your matching rules that is sufficiently precise to enable one to write a hash algorithm. We need to see a definition: "two elements A and B are considered to be the same if and only if they satisfy the following conditions: ....". Michael Kay Saxonica > On 13 Jun 2016, at 02:17, Graydon graydon@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > On Sat, Jun 11, 2016 at 05:21:09PM -0000, Dimitre Novatchev dnovatchev@xxxxxxxxx scripsit: > Hi Dimitre -- >> Actually, I believe that calling deep-equal() can be more efficient >> than comparing hashes. >> >> The reason is simple: deep-equal() most probably returns false at the >> first possible moment -- for example, noticing that an element has >> different attributes than its counterpart. >> >> On the other side, with hashing, the hashes for the two whole >> subtrees have to be calculated and only after that they can be >> compared. >> >> To summarize, with the exception of the case when the two subtrees are >> equal, deep-equal may perform faster than generating and comparing >> hashes on the subtrees. > > I've got one input document with ~5000 trees that are mappable to XSD > schema definitions; about half are complexTypes. Many are structurally > the same but have different names. (All ~5000 have unique names.) > > The idea is to group them by structural sameness; deep-equal, even very > efficiently implemented deep-equal, gives me n^2 as I have to go through > the whole tree for each element and ask "are you like me?" pairwise. > Some of the equivalent structures will have a lot of matches -- hundreds > -- where I can't expect deep-equal to fail quickly and thus efficiently. > > Going through and decorating every element with its hash value > (@hash="something") and then using for-each-group on the lot on the > basis of the hash gives me 2n. Even if it's a very naive hash > implementation, I'd expect 2n to beat n^2 performance. > > Am I missing something? > > (I'll certainly keep deep-equal in mind if the hash approach has > unacceptable performance.) > > -- Graydon
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|