[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Initial whitespace in PI from XSLT, main body

Subject: Re: Initial whitespace in PI from XSLT, main body
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sat, 7 May 2022 22:07:59 -0000
Re:  Initial whitespace in PI from XSLT
Yes, Saxon strips any leading whitespace included in the content when you
create a processing instruction using XSLT or XQuery.

XQuery 3.1 mandates this in B'3.9.3.5. XSLT 3.0 also does so, in B'5.7.2.

The problem is that there is no way of serializing a PI in such a way that
leading whitespace in the content round-trips when the serialised output is
re-parsed. But the serialization spec mandates that you should serialize the
XML in such a way that round-tripping works.

It's unfortunate that the Data Model in B'6.5.1 doesn't state a constraint
that the content of a PI must not contain leading whitespace.

XSLT 1.0 didn't say that xsl:processing-instruction should strip leading
whitespace; and XSLT 2.0 didn't explicitly list this as an incompatible
change. But then, in XSLT 1.0, there is no way of reading a processing
instruction created by the transformation other than serialization followed by
parsing, and this process loses any leading whitespace.

Michael Kay
Saxonica


> On 7 May 2022, at 22:14, Bauman, Syd s.bauman@xxxxxxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> [Could not post whole thing due to size limitation on list. Complete text
version and separate appendices are currently available
athttps://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/
<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/>.
Since that is not a permanent store (hence the btemp/b in the path), I
will post the appendices [A], [B], and [C], hopefully as a reply to this,
shortly.]
>
>
> I have discovered a discrepancy between Saxon[1] on the one hand and
xsltproc[2] & my intuition on the other when it comes to writing a processing
instruction whose string value starts with whitespace. E.g.
>   <?syd   This is a test. This is only a test. ?>
>
> Reading
> When reading this PI, I fully expect the string value to start with the
letter bTb and end with the string bt. b. This makes sense because the
XML spec,[3] in production 16, defines a PI as
>   '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
> where, of course, 'S' is one or more occurrences of any of the four
whitespace characters. While the value string is not really defined in the
prose, it is clear from the production that the S is only required if there is
a string. This implies that the purpose of the S is to separate the PITarget
from the string.[4] I am used to greedy matching, so it makes sense to me that
a parser would think of any and all whitespace immediately following the
PITarget as a delimiter, and thus not return it as part of the value string.
>
> I grant that, as far as my small brain can tell, it would not be against the
production for a parser to use non-greedy matching, decide only the first
whitespace character matches the S, and that all following whitespace
characters should match "Char*". But that is not what I expect, because it
seems to violate the spirit of the production b if that were the desired
result, why wouldnbt the spec use "(#x20 | #x9 | #xD | #xA)" between the
PITarget and the rest, rather than "(#x20 | #x9 | #xD | #xA)+"?[5]
Furthermore, if this were the parsing algorithm, it would be possible to end
up with a string value of a PI that contained nothing but whitespace
characters. While not utterly insane, it does seem to be the kind of
complication that is likely to be more trouble than it is worth. Besides, as I
said, I am used to greedy matching and expect writers of XML parsers to be
like me. p
>
> And, perhaps more importantly, the string value of a processing instruction
node in the XDM is defined as bThe data part of the source PI, not including
the whitespace that separates it from the PITarget.b[6]
>
> Writing
> But what if I try to write a PI whose string value starts with one or more
whitespace characters?
>
> First, we know the processor is required to write out one or more whitespace
characters between the PITarget and the value string. I presume (without
knowing for sure) that the processor is welcome to use whatever set of
whitespace characters it wants to separate the PITarget from the rest when it
serializes a PI. (I have never seen nor heard of a processor that uses
anything other than a single space (U+0020) character, myself.) I further
suspect that most processors would choose to not use any whitespace characters
when serializing a PI that does not have a value string.
>
> But if I am explicitly giving the processor a string to use as the value of
the PI that starts with space, I sort of expect that string, including the
leading space, to appear in the output after whatever space the processor
normally uses to separate a PITarget from a value string. And that is the
behavior I get from xsltproc.[B]
<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/Ap
pendix_B_xsltproc_output.xml> But it is not the behavior I get from Saxon.[C]
<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/Ap
pendix_C_Saxon_output.xml>
>
> So is Saxon in error, or is xsltproc in error, or is the spec ambiguous and
either behavior is OK, or something else?
>
> P.S. I have tried a few various combinations of the -strip: commandline
parameter to Saxon, and changing the program[A]
<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/Ap
pendix_A_XSLT_and_input.xslt> from an XSLT 1.0 pgm to an XSLT 3.0 pgm, same
results.
>
>
> Notes
> [1] SaxonJ-HE 11.2 run in GNU bash on an Ubuntu 20.04.4 system.
> [2] Using libxml 20910, libxslt 10134 and libexslt 820 on same system.
> [3] https://www.w3.org/TR/REC-xml/ <https://www.w3.org/TR/REC-xml/>
> [4] This becomes clearer if you reduce all that bany sequence of
characters except NOT "?>"b stuff to something simple:
>          '<?'  PITarget  (S (StringSansQuestionPointy) )?  '?>'
> [5] I have to admit, though, the fact that the spec lists the illegal
PITargets as b" XML ", " xml "b, putting spaces around the illegal Names,
gives me pause. If there were only a space after, it would really boggle my
thought process. But since there is space both before and after I suspect it
is not intended, and this is just an error or editorial style I disagree
with.
> [6] Kay, Michael, _XSLT 2.0 and XPath 2.0_, 4th ed. Wiley Publishing, Inc.,
Indianapolis, IN. p. 51.
>
>
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by
email <>)

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.