Re: Re: XML/XHTML fragment to text

Play the video

Subject: Re: Re: XML/XHTML fragment to text
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 16 Aug 2007 16:34:17 +0200

A couple of corrections on my previous statements on calculating the size of the output byte stream, see below:

Abel Braaksma wrote:

It also correctly gives < as 4 characters when it is part of a text node or an attribute.

I meant: 4 bytes.

It *does not* correctly interpret cdata-section-elements on the xsl:output definition, but that's only a minor inconvenience (and an insignificant little bug in Saxon)

My mistake. I was trapped (again!) in the html-start-element-defaults-to-html-output-method atrocity of XSLT (not corrected in XSLT 2.0, probably for backward compatibility). This led to the cdata-section-elements attribute being ignored. Changing the method to "xml" or "xhtml" fixed the problem (as does changing the start element into something not <html>).

it does correctly interpret the omit-xml-declaration yes/no.

And it correctly interprets all other kinds of stuff of the xsl:output. I.e., I tested the following attributes that can alter the serialization:

* cdata-section-elements
* doctype-public
* doctype-system
* encoding
* exclude-result-prefixes
* include-content-type
* indent
* media-type
* method
* normalization-form
* omit-xml-declaration
* standalone
* use-character-maps
* disable-output-escaping

I did not test the following though:

* escape-uri-attributes * extension-element-prefixes (does not influence the outcome when used on xsl:output) * undeclare-prefixes (only for XML 1.1 anyway) * use-when (does not influence output, but switches the instruction on/off) * version * xml:space * xsl:version (might influence the way things are escaped)

One attribute on xsl:output causes problems always, as far as I could tell, which is the following:

* byte-order-mark

When you use it together with UTF-8 it will offset the amount by one. This is because the byte order mark (xFEFF), when interpreted as a string, will be translated into the equivalent string representation in UTF-8, which is the byte sequence xEFBBBF, now representing the codepoint 65279 (U+FEFF) (Zero Width No Break Space, deprecated but allowed). This interpretation is in lieu of the Unicode recommendation. It is useless to put a BOM at the beginning of a UTF-8 stream, so it is best to avoid it.

You must be careful that the selected encodings match. If they don't, the string-to-hexBinary function will proof leading (logically so).

This was incorrect. The string will be radically different when, for instance, it is encoded in US-ASCII, and anything encoded in US-ASCII will always have the same representation in string-to-hexBinary if you use any of the non-IBM encodings, including UTF-8. In UTF-16 it will double, of course.

Consider the following (extreme) example: <xsl:output name="output-def" method="xml" encoding="US-ASCII" cdata-section-elements="p" />

<xsl:template name="main"> <xsl:variable name="result-tree">resumC)'s</xsl:variable> <xsl:variable name="serialized" select="saxon:serialize($result-tree, 'output-def')" /> <xsl:variable name="hexBin" select="saxon:string-to-hexBinary($serialized, 'UTF-8')" /> <xsl:variable name="length" select="string-length(xs:string($hexBin)) div 2" />

....
</xsl:template>

Normally, the output in $serialized would look like the following:

o;?<![CDATA[resumC)'s]]>

But, because of the low encoding chosen, the serializer must remove the C) character from the CDATA section, with the following as a result:

<![CDATA[resum]]>é<![CDATA['s]]>

Obviously, the lengths are quite different. The string size of the first is 28 and the second is 44. The UTF-8 byte sequence of the first is 28 (because of the interpretation of C)) and 44 in the second (because US-ASCII is 100% compatible with the 1-byte sequences of UTF-8).

No need to say that, apart from this extreme, it will be very hard to find out all the possible other ways that the serializer will use to output a conformant byte stream. I must admit that I've found this approach very refreshing and using this saxon-specific extension, it comes pretty close to finding the exact byte length of the document (or segment) *after* serialization (including white space, indentation, escaping etc).

Thanks for the exercise ;)

Cheers,
-- Abel Braaksma

Current Thread
Re: XML/XHTML fragment to text, (continued) Alain - 15 Aug 2007 16:59:19 -0000 Alain - 15 Aug 2007 16:59:38 -0000 Abel Braaksma - 16 Aug 2007 13:41:10 -0000 Abel Braaksma - 16 Aug 2007 13:41:53 -0000 Abel Braaksma - 16 Aug 2007 14:34:50 -0000 <= Abel Braaksma - 16 Aug 2007 14:43:20 -0000

<- Previous	Index	Next ->
Re: Re: XML/XHTML fragment to, Abel Braaksma	Thread	Re: Re: XML/XHTML fragment to, Abel Braaksma
Re: How to extract HTTP Heade, Abel Braaksma	Date	RE: Unexpected Results with f, Angela Williams
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >