[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Tree Comparing Algorithm

Subject: Re: Tree Comparing Algorithm
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 4 Feb 2020 00:49:31 -0000
Re:  Tree Comparing Algorithm
I haven't studied it in close detail, but I strongly suspect that the initial
processing of the input files is streamed, but at some stage in the processing
pipeline everything ends up in memory.

Martin's solution uses arrays, and array processing in Saxon is generally not
pipelined in the way that sequence processing (normally) is. For example,
operations such as filtering and mapping on sequences are generally pipelined
(whether or not the input is streamed), while the equivalent operations on
arrays will materialise the array in memory.

For example, if you do (child::*[@x][1]/node-name() = $Q), then whether or not
the child nodes are held in memory or streamed, Saxon will not build the
intermediate sequence child::*[@x] in memory; it will effectively do something
like

for each child::*
   if exists(@x)
      if (first)
          if (node-name() = $Q)
              return true
      else return false;

There's no equivalent of this for array processing right now. A construct like
[child::*/node-name()]?1 = X will materialize the array in memory, even if
child::* is streamed; and this doesn't count as a violation of streamability,
because we're not holding source nodes in memory, we're holding intermediate
computed results in memory.

There's no intrinsic reason for not pipelining operations on arrays, other
than the lesson I learnt many years ago as an undergraduate computer science
student: when you're doing optimisation, focus your efforts on the constructs
that are encountered most frequently. Today everyone is using sequences, and
not many people are using arrays.

Michael Kay
Saxonica

> On 3 Feb 2020, at 20:39, Martin Honnen martin.honnen@xxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On 03.02.2020 21:10, Vasu Chakkera vasucv@xxxxxxxxx wrote:
>> Thanks both. Martin's solution sort of worked, but it only gave me 21
>> children, but I had around 21000 nodes in the xml. I am not sure to what
>> depth the comparison is happening.
>
> It was solely an attempt to try to find some way to recursively process
> two documents with streaming at the same time, not an attempt to
> implement your particular algorithm.
>
> I have tested my code now on a large files, it seems to process lots of
> nodes judging by the output to the console and the length of processing,
> but it doesn't seem to use streaming when I look at the memory
> consumption (600MB  of input needed more than 2GB of memory), even if
> Saxon nowhere shows any -t message that input trees were built.
>
> Michael's comment on the way streaming is implemented in Saxon suggests
> that the whole attempt is futile, even if the code somehow manages to
> get by the streamability analysis.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.