[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Re: XML/XHTML fragment to text
A couple of corrections on my previous statements on calculating the
size of the output byte stream, see below:
Abel Braaksma wrote:
I meant: 4 bytes. It *does not* correctly interpret cdata-section-elements on the xsl:output definition, but that's only a minor inconvenience (and an insignificant little bug in Saxon) My mistake. I was trapped (again!) in the html-start-element-defaults-to-html-output-method atrocity of XSLT (not corrected in XSLT 2.0, probably for backward compatibility). This led to the cdata-section-elements attribute being ignored. Changing the method to "xml" or "xhtml" fixed the problem (as does changing the start element into something not <html>). it does correctly interpret the omit-xml-declaration yes/no. And it correctly interprets all other kinds of stuff of the xsl:output. I.e., I tested the following attributes that can alter the serialization: * cdata-section-elements * doctype-public * doctype-system * encoding * exclude-result-prefixes * include-content-type * indent * media-type * method * normalization-form * omit-xml-declaration * standalone * use-character-maps * disable-output-escaping I did not test the following though: * escape-uri-attributes * extension-element-prefixes (does not influence the outcome when used on xsl:output) * undeclare-prefixes (only for XML 1.1 anyway) * use-when (does not influence output, but switches the instruction on/off) * version * xml:space * xsl:version (might influence the way things are escaped) One attribute on xsl:output causes problems always, as far as I could tell, which is the following: * byte-order-mark When you use it together with UTF-8 it will offset the amount by one. This is because the byte order mark (xFEFF), when interpreted as a string, will be translated into the equivalent string representation in UTF-8, which is the byte sequence xEFBBBF, now representing the codepoint 65279 (U+FEFF) (Zero Width No Break Space, deprecated but allowed). This interpretation is in lieu of the Unicode recommendation. It is useless to put a BOM at the beginning of a UTF-8 stream, so it is best to avoid it.
This was incorrect. The string will be radically different when, for instance, it is encoded in US-ASCII, and anything encoded in US-ASCII will always have the same representation in string-to-hexBinary if you use any of the non-IBM encodings, including UTF-8. In UTF-16 it will double, of course. Consider the following (extreme) example: <xsl:output name="output-def" method="xml" encoding="US-ASCII" cdata-section-elements="p" /> <xsl:template name="main"> <xsl:variable name="result-tree"><p>resumC)'s</p></xsl:variable> <xsl:variable name="serialized" select="saxon:serialize($result-tree, 'output-def')" /> <xsl:variable name="hexBin" select="saxon:string-to-hexBinary($serialized, 'UTF-8')" /> <xsl:variable name="length" select="string-length(xs:string($hexBin)) div 2" /> .... </xsl:template> Normally, the output in $serialized would look like the following: o;?<p><![CDATA[resumC)'s]]></p> But, because of the low encoding chosen, the serializer must remove the C) character from the CDATA section, with the following as a result: <p><![CDATA[resum]]>é<![CDATA['s]]></p> Obviously, the lengths are quite different. The string size of the first is 28 and the second is 44. The UTF-8 byte sequence of the first is 28 (because of the interpretation of C)) and 44 in the second (because US-ASCII is 100% compatible with the 1-byte sequences of UTF-8). No need to say that, apart from this extreme, it will be very hard to find out all the possible other ways that the serializer will use to output a conformant byte stream. I must admit that I've found this approach very refreshing and using this saxon-specific extension, it comes pretty close to finding the exact byte length of the document (or segment) *after* serialization (including white space, indentation, escaping etc). Thanks for the exercise ;) Cheers, -- Abel Braaksma
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|