|
[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Regex-Enabled XSLT is Possible -- Preliminary Resu
Hi Michael,
thanks for your interest. Yes, I am moving the whole approach over to XSLT/XPath and I'm almost done. Great thing is that you do in fact allow variables in xsl:template/@match patterns now, with that and the JDK 1.4 standard java.util.regex features I could get rid of my regex wrapper around ORO-matcher entirely. I still do not see a good way how the XSLT analyze-string and the XPath regex functions fit into my scheme of things. And I'll think about this some more. So far I still believe I want to use a little stateful matcher, which the java.util.regex.Matcher fortunately gives me. Here is an example. I pair up match patterns and templates one for one. No two templates use the same matcher object -- that way I believe I'm safe w/r/t side-effects and parallelism. Here is a little example that parses email headers: <xsl:variable name="header-pattern"
select="'^([fF]rom|[tT]o|[cC]c|[sS]ubject): (.*)\n'"/> <xsl:variable name="header-matcher"
xmlns:p="java:java.util.regex.Pattern"
select="p:matcher(p:compile($header-pattern),'')"/> <xsl:template xmlns:m="java:java.util.regex.Matcher"
match="text()[m:looking-at(m:reset($header-matcher,.))]">
<xsl:element name="{lower-case(m:group($header-matcher, 1))}"
namespace="">
<xsl:value-of select="m:group($header-matcher, 2)"/>
</xsl:element>
<xsl:variable name="rest">
<xsl:value-of select="substring(.,m:end($header-matcher)+1)"/>
</xsl:variable>
<xsl:apply-templates select="$rest/text()"/>
</xsl:template> <xsl:template match="text()">
<rest>
<xsl:value-of select="."/>
</rest>
</xsl:template>it only uses one pattern-template pair. I have a few diffs to SAXON that I'll send to you under separate cover to make this possible. Basically they add the new CharSequence of Java into the set of Java types considered for conversion. It's easy, but it raises some opportunities for performance improvement with string and text handling in general. That ties in with your issue with text nodes: > Interesting approach. Generally, creating nodes is expensive. It also > requires a lot of specification work to sort out the detail, e.g. what > is the parent of the node, what is its base URI, do you get a new text > node each time or can the system reuse them? I think a mechanism based > on strings (like xsl:analyze-string) is more flexible than one based > on text nodes. I share your concern. I am not comfortable with the amount of string garbage that my method probably produces right now. But that could be helped with some screwing under the hood :-) Here are some ideas: - the first reason why I construct text nodes is because I can't xsl:apply-templates on a string or other atomic data type. Why? To me it would make sense to consider apply-templates on an atom as implicitly on a singleton sequence of those atoms. - the secon reason why I construct text nodes is to return an
unparsed rest from a template (return from a "parse-down")
or to feed back into the recursion ("parse-along").Text nodes would not have to be expensive at all, however. Here is where CharSequences come in. Instead of String, one should perhaps use CharSequence throughout. That way you would never copy the string data itself, all you'd do is pass along those little offset-length pairs. So, apart from object creation, this type of string handling would be quite cheap. So, I'd say that if Saxon would underpin the XPath string and text data types with CharSequence type of offset-length pairs rather than copying java.lang.String data, there would be no big penalty in text node creation and hence no changes would be necessary to the rules of what can and cannot be given to apply-templates. Of course this assumes that you don't make changes to the string data, such as with some regex replace thing. Well, if you do, then you need a copy-on-write hook to then copy out the data block. For parsing, you don't need to modify text at all (I construct new text), so, I don't care too much how that's solved.
b) a meachanism to fail a template and try the next This is a "could" in the XSLT 2.0 requirements list and we've just started reviewing whether to do anything about this, so any use cases will be welcome - send them please to public-qt-comments@xxxxxx ... well, now that I'm redoing the whole thing again, it looks like it could work without that. It is good to discuss these things with people.
So, thanks to your feedback, Michael, I will be able to boil my thing down and distill the real remaining issues. Finally you say: I would also add that general-purpose parsing (like, writing a COBOL compiler in XSLT) was not really the application we had in mind. The real test is whether the facilities are adequate to analyze the structure found in the text of typical data files. I've used them for "screen-scraping" data downloaded in HTML and found them quite workable, though it needed several passes. I agree to an extent. Parsing truly formal language I would do differently. I'm sure XSLT will suit that purpose just fine, but with a formal language I can use pure pushdown-automata with formally specified grammar and a compiler into state-transition- action tables and all that. I have done screen scraping with YACC in the past and found that, while I could make it work, it is nothing that you can let people do who are more on the IT maintenance level. They need something that makes what they do simpler. We have dozens of text report types from all sorts of places. Some of these reports change over night, some are never really structurally controlled. This requires an approach where a "grammar" can be specified fairly simply by people who cannot speak BNF and who could not write even a simple parser themselves. (I'm always suprized how little IT people know about parsers -- which, I might point out, is not a reason to be rude though.) thanks much, -Gunther
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|

Cart








