[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Re: XML/XHTML fragment to text

Subject: Re: Re: XML/XHTML fragment to text
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 16 Aug 2007 06:39:14 +0200
Re:  Re: XML/XHTML fragment to text
Hi Alain,

You find yourself in a typical legacy-heritage entanglement. It is this kind of trouble that old legacy can give us and that costs companies zillions in time & material.

see my comments,

Cheers,
-- Abel

Alain wrote:

Personally I would prefer Saxon: XSLT2.0 make things so much easier.

indeed.



But at work, the only thing that has been authorized for now is Xalan-C. It is running in batches (jobs) on AIX machines. The reason why they are not considering another transformation engine, at the moment, is performance. Even for a small transformation if you run Saxon or Xalan-J, you will have to set up an run a JVM in your Unix batch. Launching the JVM has a cost in memory and time. And even if you don't count the JVM cost, Saxon is Java code, so it has to pay the Java overload compared to a code written in C++... Although Saxon may perform faster on some specific templates where it has better optimisations, on an "average" template it will still be slower because it's Java versus C++.

You are mixing things up a bit. If you want that your apps run at dazzling speed, you should code in C++, or ASM for that matter. But that's not what you are doing. You are using XSLT, and that is an interpreted language. In terms of speed, Saxon-J runs much faster than Xalan-C. It might be that Xalan-J runs a bit slower than Xalan-C, but that will only marginally be so (and if it is not marginally so, than the port has been done badly).


Yes, starting the JVM has a cost. If you have many small batches, than that's a problem. If they are large, than it is negligible. But it is easy to workaround: let the JVM stay in-memory and you are done.

But this all is a useless discussion of course if "authorization" by the AIX team is an issue. If you can use any XSLT 2.0 processor, it is likely that your speed increases by a magnitude (I'm not talking percentages, I am talking factors). The reason that I dare say that is that you seem to use many recursive templates that are called quite repeatedly. If you want me to help you port it (once you've convinced the team that using JVM on AIX for XSLT will increase the batches' speed by a magnitude) you can contact me off-list for that.

The goal is to be able to run a 5 million base customer, so we have
to count every second in our batch process.

Just for comparison: I've done a job for KPN (largest phone company in Holland) that sends 8 million invoices each month in 14 batches. Each batch processes between 2 and 4 GB of data. Using XSLT 1.0 this was a nightmare, a batch taking up to 14 hours. Using XSLT 2.0 this has become a breeze and it runs a batch in about one to two hours (there's more to it than only this of course, like that another process creates the AFP files for the printer and PDF is output for WORM tape, all in the same time).


If you have to code for speed, there's no other option than to switch to XSLT 2.0 and the JVM.


So they are definitely running a JVM inside main the batch,

so, what are you waiting for? Let it run Saxon as well ;)



substring(concat(myString,$padding),1,$N) to pad it correctly

In XSLT 2.0 you can do:


$myString, for $i in 1 to $FieldLen - string-length($myString) return ' '

(the comma is intentional) or anything similar. But you are right, the concat-trick is just as easy.

I think I saw a padding function in EXSLT, but it doesn't seem to have been made standard in 2.0
indeed, it is not.


Or we could probably write (or buy) "generic" patterns to transform to fix-length.

I have them on the shelf, I use them regularly. If you are interested.... ;)



The last bit of headache is the "UTF-8" problem ! Because fixed-length is fixed-length in *bytes*.

aha, of course. The eternal legacy problem: back in the 70s they didn't think international yet...



For that, with XSLT1.0, I agree with you, I had to build insane recursive templates to calculate the length in bytes of an UTF-8 string.

This is practically impossible because you don't know exactly how the serializer will serialize. I.e., when it will use &lt; and when <. Furthermore, UTF-8 can be encoded in different ways for one single character. In XSLT 2.0 you can cover this with the normalize-unicode attribute of xsl:output, in XSLT 1.0 you cannot and I haven't found a note on how to treat it.


If you have XSLT 1.0 and you want to know exactly the size of bytes, use UTF-32 and you can (almost) be certain of the correct length (apart from the &lt; / &quot; etc). Drawback is the almost 4-fold increased size (you can use UTF-16 if all you need are the plane-1 characters).

[...]
or is there a function I didn't notice that can return a string length in
bytes and not in chars ?

Yes and no. But there's a simple trick. And this will solve your problems 100%, I believe, as long as you can find your bosses to move onto Saxon, because that's the only processor I found that can do it correctly. Forget serializing + reading back as unparsed-text, use this instead:


<xsl:output name="output-def" encoding="UTF-8" normalization-form="NFD" omit-xml-declaration="yes" />

<snip ... />

<xsl:variable name="serialized" select="saxon:serialize($my-result-tree, 'output-def')" />
<xsl:variable name="hexBin" select="saxon:string-to-hexBinary($serialized, 'UTF-8')" />
<xsl:variable name="length" select="string-length(xs:string($hexBin)) div 2" />


I tested it, and it works even so well that it returns different amounts when you choose different normalization-forms (i.e., Compose / Decompose will give radically different results). It also correctly gives &lt; as 4 characters when it is part of a text node or an attribute. It *does not* correctly interpret cdata-section-elements on the xsl:output definition, but that's only a minor inconvenience (and an insignificant little bug in Saxon), it does correctly interpret the omit-xml-declaration yes/no.

You must be careful that the selected encodings match. If they don't, the string-to-hexBinary function will proof leading (logically so).

All-in-all, this is by far the easiest way to calculate the length of a node in bytes. And you can use the resulting string to put into your fixed-length system as you want:

<xsl:function name="f:padding" as="xs:string">
<xsl:param name="string" as="xs:string" />
<xsl:param name="width" as="xs:integer" />
<xsl:value-of select="$string, for $i in 1 to $width - string-length($string) return ' ' " separator="" />
</xsl:function>


<snip ... />

<xsl:sequence select="f:padding($columnData1, 20)" />
<xsl:sequence select="f:padding($columnData2, 4)" />
<xsl:sequence select="f:padding($serialized,4096)" />
<xsl:sequence select="f:padding($columnData3, 400)" />
<xsl:sequence select="f:padding($columnData4, 2)" />
<xsl:sequence select="f:padding($columnData5, 12)" />
..... etc

Convinced that things *can* be easier in XSLT 2.0?
Though I only showed you very few XSLT 2.0 specific things. Your major gain of switching to Saxon is that you can use the saxon:serialize() function. Otherwise it will be quite hard to guarantee that your recursive templates will be correct (I think that it is not so hard to proof that they are incorrect, unless you really rewrite the serialization algorithm of your processor in XSLT 1.0).

You came to the same conclusion, your advise been to separate the variable part (e.g. HTML) in a temporary file, even if your templates are smarter and to put every piece together again.

See above, using the right tools for the job, you will not need this hard-to-maintain solutions anymore.


But as I'm on holidays now, I'll have to check the project
status when I'm back in September !

Enjoy your holidays!


Cheers,
-- Abel Braaksma

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.