[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Sorting substitution instructions by max. length of m
(Though not absolutely trivial, the problem described below seems to be
common enough that its XSLT 2 solution should have been discussed
somewhere already, but I didn't find anything like this anywhere. Do you
know of such a resource?)
Given a sequence of substitution instructions in $substitution_instructions, each represented as a node <substitution> <old>regex</old> <new>replacement</new> </substitution> specifying the regex pattern to search for and the match's replacement, the task is to sort the substitution instructions by the length of their longest match in the input text held in $input. The aim, of course, is to be able to perform multiple substitutions in the same input while replacing longer matches first, in order to avoid the classical matching prefix problem. (The actual replacing procedure would be a topic of its own; just assume that the replacement inserted into the input text will be safe from being changed itself again. So even if the input text cannot be considered constant across the substitutions, their proper order of application should be independent of the replacements made, as far as I can see, because if the maximum match length for some regex might decrease, it will never increase when some of its matches disappear through the replacing of overlapping matches. Correct me if I'm being naive here.) I have come up with a solution based on these steps: 1) Construct a new sequence of substitution instructions that include the maximum match length for each of them, removing all substitution instructions with no match. 2) Sort the substitution instructions first by maximum match length, then alphabetically (i.e. as strings according to the implementation-defined collation) by their regexes. (This second sort key component just serves to produce predictable behavior in case of equal lengths.) An implementation of this approach is demonstrated by the XSLT 2 stylesheet below (to be passed to Saxon with "-it main"). My question is if you have any suggestions for making the code simpler or more elegant, or if you would recommend taking another way. Yves
<!-- using dashes in function names, underscores in variable names --> <xsl:function name="my:annex-max-match-length-to-instruction" as="node()*"> <xsl:param name="inputtext" as="xs:string"/> <xsl:param name="substitution_instruction" as="node()"/> <xsl:variable name="regex" select="string($substitution_instruction/old)"/> <xsl:variable name="max_match_length" select="max(my:lengths-of-remaining-matches($inputtext, $regex))"/> <!-- Add max. match length to subst. instruction, drop those without matches --> <xsl:if test="$max_match_length ne 0"> <xsl:element name="substitution"> <xsl:element name="max_match_length"> <xsl:sequence select="$max_match_length"/> </xsl:element> <xsl:copy-of select="$substitution_instruction/old"/> <xsl:copy-of select="$substitution_instruction/new"/> </xsl:element> </xsl:if> </xsl:function> <xsl:function name="my:lengths-of-remaining-matches" as="xs:integer*"> <xsl:param name="inputtext" as="xs:string*"/> <xsl:param name="regex" as="xs:string"/> <xsl:variable name="length_of_next_match" select="if (not (matches($inputtext, $regex))) then 0 else string-length(replace($inputtext, concat('^.*?(', $regex, ').*'), '$1'))"/> <!-- Return 0 if no match (zero-length regexes can't occur) --> <xsl:sequence select="$length_of_next_match"/> <!-- Examine further matches by recursive call --> <xsl:if test="$length_of_next_match ne 0"> <xsl:sequence select="my:lengths-of-remaining-matches( replace($inputtext, concat('^.*?', $regex), ''), $regex)"/> </xsl:if> </xsl:function> <xsl:template name="main"> <!-- sample data --> <xsl:variable name="input" select="'abcddddxxxxxxxyyyabcxxxxxabcdxabc'"/> <!-- sample data --> <xsl:variable name="substitution_instructions"> <xsl:element name="substitution"> <xsl:element name="old">[a-e]+</xsl:element> <xsl:element name="new">***</xsl:element> </xsl:element> <xsl:element name="substitution"> <xsl:element name="old">d</xsl:element> <xsl:element name="new">#</xsl:element> </xsl:element> <xsl:element name="substitution"> <xsl:element name="old">123</xsl:element> <xsl:element name="new">...</xsl:element> </xsl:element> <xsl:element name="substitution"> <xsl:element name="old">c+</xsl:element> <xsl:element name="new">#</xsl:element> </xsl:element> <xsl:element name="substitution"> <xsl:element name="old">x+y*</xsl:element> <xsl:element name="new">+++</xsl:element> </xsl:element> </xsl:variable> <xsl:variable name="substitution_instructions_sorted"> <xsl:perform-sort> <xsl:sort select="max_match_length" data-type="number" order="descending"/> <xsl:sort select="old" order="ascending"/> <xsl:for-each select="$substitution_instructions/substitution"> <xsl:sequence select="my:annex-max-match-length-to-instruction($input, .)"/> </xsl:for-each> </xsl:perform-sort> </xsl:variable> <xsl:message> <xsl:value-of select="concat('input: ', $input, ' ')"/> <xsl:value-of select="'======================================== '"/> <xsl:for-each select="$substitution_instructions_sorted/substitution"> <xsl:value-of select="concat('regex: ', old, ' ')"/> <xsl:value-of select="concat('max. match length: ', max_match_length, ' ')"/> <xsl:value-of select="'---------------------------------------- '"/> </xsl:for-each> </xsl:message> </xsl:template> </xsl:stylesheet>
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|