[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Efficient recursive grouping in XSLT 1.0?

Subject: RE: Efficient recursive grouping in XSLT 1.0?
From: "Michael Kay" <mhk@xxxxxxxxx>
Date: Fri, 19 Mar 2004 00:34:09 -0000
xslt 1.0 group
Interesting problem.

The thing that seems to make a dramatic difference to Saxon's speed on this
transformation is to introduce a variable for the expression $parents/*,
thus:

  <xsl:template name="subtree">
    <xsl:param name="parents"/>
    <xsl:variable name="children" select="$parents/*"/>
    <xsl:for-each select="$children">
      <xsl:variable name="current-group"
          select="$children[name() = name(current())]"/>
      <xsl:if test="count($current-group[1] | .) = 1">

The surprising thing in this stylesheet is that the number of recursive
calls is not actually very high - it's the same as the number of elements
output to the result tree, which doesn't increase greatly as the input file
grows. But the size of $parents is linear with the source file size, which
means that calculating $parents/* within the for-each loop is really bad
news. I imagine that your rearrangement of the code somehow enabled MSXML to
spot that it could move the subexpression out of the for-each loop; why it
should do so on the rearranged code and not on the original is a mystery.
Saxon is currently doing this kind of optimization at the XPath (and XQuery)
level but not at the XSLT level.

Michael Kay

# -----Original Message-----
# From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx 
# [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Ian Young
# Sent: 18 March 2004 22:30
# To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
# Subject:  Efficient recursive grouping in XSLT 1.0?
# 
# I have some fairly sizeable XML files (about 5MB - 15MB) that 
# I want to extract the element name structure from. By that, I 
# mean that I'll feed an arbitrary XML document in one end, and 
# get out an XML document that contains one element for every 
# distinct list of name() values in the ancestor-or-self axis, 
# retains the position in document order of the first 
# occurrence. I'd also like a count for each element output.
# For the moment, I'm only interested in the elements: 
# attributes and text content are not important.
# 
# For example, if I have this input document:
# <root>
#   <a>
#     <b/><c/><d/><b/>
#   </a>
#   <b>
#     <a>
#       <a/>
#     </a>
#   </b>
#   <a>
#     <a/>
#     <b>
#       <a/>
#       <d/>
#       <a/>
#       <a/>
#     </b>
#   </a>
# </root>
# 
# I would want output that looks like this:
# <root ct="1">
#   <a ct="2">
#     <b ct="3">
#       <a ct="3"/>
#       <d ct="1"/>
#     </b>
#     <c ct="1"/>
#     <d ct="1"/>
#     <a ct="1"/>
#   </a>
#   <b ct="1">
#     <a ct="1">
#       <a ct="1"/>
#     </a>
#   </b>
# </root>
# 
# In XSLT 2.0 this can be readily achieved by using 
# for-each-group recursively. With Saxon 7.9 this stylesheet 
# will generate a result for one of my 7MB files (400k 
# elements) in 2 or 3 seconds (mostly parsing the input
# document) to produce the output of about 1k elements.
# 
# <xsl:stylesheet version="2.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="/">
#     <xsl:call-template name="subtree">
#       <xsl:with-param name="parents" select="/"/>
#     </xsl:call-template>
#   </xsl:template>
#   
#   <xsl:template name="subtree">
#     <xsl:param name="parents"/>
#     <xsl:for-each-group select="$parents/*" group-by="name()">
#       <xsl:copy>
#         <xsl:attribute name="ct">
#           <xsl:value-of select="count(current-group())"/>
#         </xsl:attribute>
#         <xsl:call-template name="subtree">
#           <xsl:with-param name="parents" select="current-group()"/>
#         </xsl:call-template>
#       </xsl:copy>
#     </xsl:for-each-group>
#   </xsl:template>
# </xsl:stylesheet>
# 
# This is great, except that the system I want this to run in 
# is using libxml and libxslt (consequently, XSLT 1.0).
# Now, transforming this into an XSLT 1.0 stylesheet can be 
# done by just replacing the for-each-group with a for-each and 
# checking if the context node is the first in $parents/* with 
# that name:
# 
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="/">
#     <xsl:call-template name="subtree">
#       <xsl:with-param name="parents" select="/"/>
#     </xsl:call-template>
#   </xsl:template>
#   
#   <xsl:template name="subtree">
#     <xsl:param name="parents"/>
#     <xsl:for-each select="$parents/*">
#       <xsl:variable name="current-group"
#           select="$parents/*[name() = name(current())]"/>
#       <xsl:if test="count($current-group[1] | .) = 1">
#         <xsl:copy>
#           <xsl:attribute name="ct">
#             <xsl:value-of select="count($current-group)"/>
#           </xsl:attribute>
#           <xsl:call-template name="subtree">
#             <xsl:with-param name="parents" select="$current-group"/>
#           </xsl:call-template>
#         </xsl:copy>
#       </xsl:if>
#     </xsl:for-each>
#   </xsl:template>
# </xsl:stylesheet>
# 
# But the performance of this on large files is hopeless. I 
# tried a smaller 1.3MB, 75k elements input file that generates 
# a 5kB, 125 element output. msxml took 2 minutes (memory usage 
# displaying odd sawtooth pattern
# -- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up 
# on after 10 minutes.
# 
# [Aside: I wondered whether it might be better to expanded 
# $current-group in the test and only created the variable 
# within the if element. The rationale being that the variable 
# might be created strictly whereas the combined xpath 
# expression might only be evaluated up to the first element.
# 
#   ...
#   <xsl:if test="count(($parents/*[name() = 
# name(current())])[1] | .) = 1">
#     <xsl:variable name="current-group"
#         select="$parents/*[name() = name(current())]"/>
#     ...
# 
# For msxsl makes an enormous difference: it's now faster than 
# Saxon 7 running the XSLT 2.0 stylesheet -- that's for the 7MB 
# file!  How does _that_ work?! It doesn't make any great 
# difference to Saxon (or libxslt?).]
# 
# Am I right that there's no way of using Muenchian grouping 
# for this because there's no possible XPath 1.0 expression for 
# the list of ancestors' node names?
# 
# Are there any other approaches with XSLT 1 that might get a 
# decent performance with libxslt?
# 
# One thing I thought of is to do the transform in two steps: 
# the first generates output isomorphic to the input elements, 
# but annotates each with its ancestor name list (as a string); 
# the second traverses that file using Muenchian grouping. This 
# works -- albeit with rather variable performance
# -- in msxml. For the 1.3MB file step 2 takes 71 seconds in 
# Saxon and 75 in libxslt (and 1 second in msxml). This confuses me!
# 
# TODO: Saxon gets java.lang.OutOfMemoryError on the second 
# stage reading the output of stage 1 for larger files. How do 
# I increase Java's memory limit?
# 
# <!-- step 1 -->
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:template match="*">
#     <xsl:param name="parent-path" select="''"/>
#     <xsl:variable name="p" select="concat($parent-path,'/',name())"/>
#     <xsl:copy>
#       <xsl:attribute name="p">
#         <xsl:value-of select="$p"/>
#       </xsl:attribute>
#       <xsl:apply-templates>
#         <xsl:with-param name="parent-path" select="$p"/>
#       </xsl:apply-templates>
#     </xsl:copy>
#   </xsl:template>
# 
#   <xsl:template match="text()"/>
# </xsl:stylesheet>
# 
# <!-- step 2 -->
# <xsl:stylesheet version="1.0"
#     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
#   <xsl:output method="xml" version="1.0" encoding="UTF-8" 
# indent="yes"/>
# 
#   <xsl:key name="p" match="*" use="@p"/>
# 
#   <xsl:template match="*">
#     <xsl:if test="generate-id() = generate-id(key('p', @p)[1])">
#       <xsl:copy>
#         <xsl:attribute name="ct">
#           <xsl:value-of select="count(key('p', @p))"/>
#         </xsl:attribute>
#         <xsl:for-each select="key('p', @p)">
#           <xsl:apply-templates/>
#         </xsl:for-each>
#       </xsl:copy>
#     </xsl:if>
#   </xsl:template>
# 
#   <xsl:template match="text()"/>
# </xsl:stylesheet>
# 
# Any comments, pointers, code suggestions gratefully received!
# 
# Cheers,
# 
# Ian.
# 
#  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
# 
# 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.