[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Re: XML/XHTML fragment to text

Subject: Re: Re: XML/XHTML fragment to text
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 16 Aug 2007 16:34:17 +0200
Re:  Re: XML/XHTML fragment to text
A couple of corrections on my previous statements on calculating the size of the output byte stream, see below:


Abel Braaksma wrote:

It also correctly gives &lt; as 4 characters when it is part of a text node or an attribute.

I meant: 4 bytes.


It *does not* correctly interpret cdata-section-elements on the xsl:output definition, but that's only a minor inconvenience (and an insignificant little bug in Saxon)

My mistake. I was trapped (again!) in the html-start-element-defaults-to-html-output-method atrocity of XSLT (not corrected in XSLT 2.0, probably for backward compatibility). This led to the cdata-section-elements attribute being ignored. Changing the method to "xml" or "xhtml" fixed the problem (as does changing the start element into something not <html>).


it does correctly interpret the omit-xml-declaration yes/no.

And it correctly interprets all other kinds of stuff of the xsl:output. I.e., I tested the following attributes that can alter the serialization:


* cdata-section-elements
* doctype-public
* doctype-system
* encoding
* exclude-result-prefixes
* include-content-type
* indent
* media-type
* method
* normalization-form
* omit-xml-declaration
* standalone
* use-character-maps
* disable-output-escaping


I did not test the following though:


* escape-uri-attributes
* extension-element-prefixes (does not influence the outcome when used on xsl:output)
* undeclare-prefixes (only for XML 1.1 anyway)
* use-when (does not influence output, but switches the instruction on/off)
* version
* xml:space
* xsl:version (might influence the way things are escaped)


One attribute on xsl:output causes problems always, as far as I could tell, which is the following:

* byte-order-mark

When you use it together with UTF-8 it will offset the amount by one. This is because the byte order mark (xFEFF), when interpreted as a string, will be translated into the equivalent string representation in UTF-8, which is the byte sequence xEFBBBF, now representing the codepoint 65279 (U+FEFF) (Zero Width No Break Space, deprecated but allowed). This interpretation is in lieu of the Unicode recommendation. It is useless to put a BOM at the beginning of a UTF-8 stream, so it is best to avoid it.


You must be careful that the selected encodings match. If they don't, the string-to-hexBinary function will proof leading (logically so).

This was incorrect. The string will be radically different when, for instance, it is encoded in US-ASCII, and anything encoded in US-ASCII will always have the same representation in string-to-hexBinary if you use any of the non-IBM encodings, including UTF-8. In UTF-16 it will double, of course.


Consider the following (extreme) example:
<xsl:output name="output-def" method="xml" encoding="US-ASCII" cdata-section-elements="p" />


<xsl:template name="main">
<xsl:variable name="result-tree"><p>resumC)'s</p></xsl:variable>
<xsl:variable name="serialized" select="saxon:serialize($result-tree, 'output-def')" />
<xsl:variable name="hexBin" select="saxon:string-to-hexBinary($serialized, 'UTF-8')" />
<xsl:variable name="length" select="string-length(xs:string($hexBin)) div 2" />


....
</xsl:template>

Normally, the output in $serialized would look like the following:

o;?<p><![CDATA[resumC)'s]]></p>

But, because of the low encoding chosen, the serializer must remove the C) character from the CDATA section, with the following as a result:

<p><![CDATA[resum]]>&#233;<![CDATA['s]]></p>

Obviously, the lengths are quite different. The string size of the first is 28 and the second is 44. The UTF-8 byte sequence of the first is 28 (because of the interpretation of C)) and 44 in the second (because US-ASCII is 100% compatible with the 1-byte sequences of UTF-8).

No need to say that, apart from this extreme, it will be very hard to find out all the possible other ways that the serializer will use to output a conformant byte stream. I must admit that I've found this approach very refreshing and using this saxon-specific extension, it comes pretty close to finding the exact byte length of the document (or segment) *after* serialization (including white space, indentation, escaping etc).

Thanks for the exercise ;)

Cheers,
-- Abel Braaksma

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.