Re: Bug in 'xsl:sort'. ( XT vs SAXON. )
Paul, >> If you go a little further on in the XSLT Recommendation, it says: >> >> "NOTE: It is possible for two conforming XSLT processors not to sort >> exactly the same. Some XSLT processors may not support some languages. >> Furthermore, there may be variations possible in the sorting of any >> particular language that are not specified by the attributes on xsl:sort, >> for example, whether Hiragana or Katakana is sorted first in Japanese. > >This is not the case here, right? ( Actualy I don't understand >why something other than UTF * should supported >by W3C standards, but that's another story ). Well I thought it might be the case here, that this might be a variation in the sorting of English (the particular language) not specified by the attributes on xsl:sort. For example, one might rationally use the rule 'ignore hyphens' when sorting, thinking that hyphens do not add semantic information to a term, or 'ignore hyphens only in the middle of words' or 'ignore hyphens when they are not followed by a number' and so on. I don't think any of these rules are unreasonable, and in certain situations they will lead to different results. >> Future versions of XSLT may provide additional attributes to provide >> control over these variations. Implementations may also use >> implementation-specific namespaced attributes on xsl:sort for this. > >This is also not the case, right ? In that we are not using a future version of XSLT and neither SAXON nor XT have documented implementation-specific namespaced attributes to determine sort order, yes. >> NOTE: It is recommended that implementers consult [UNICODE TR10] for >> information on internationalized sorting." >> >> The values should be sorted "lexicographically in the culturally correct >> manner for the language specified by lang" but I guess the question arises >> in English (as it does in other languages) about whether '-' is >> lexicographically before '0' or not. > >Right. But I'm not sure the question is about 'English'. I think the >question realy is 'in UTF8' ? I disagree. The xsl:sort documentation says: "'text' specifies that the sort keys should be sorted lexicographically in the culturally correct manner for the language specified by lang". I'm assuming that the default language in Sebastians files is English. Thus the sort should be done in English. I am no expert on character encoding, but as far as I understand it, the UTF8 values for ASCII characters all come before the UTF8 characters for accented characters. If you sorted on UTF8 character value, 'z' would come before á, whereas you would expect 'a' and all its associated accents to be grouped together. If you look at the UNICODE basekey file [http://www.unicode.org/unicode/reports/tr10/basekeys.txt], you can see that there are groups of characters with all different kinds of UTF8 values. For example all those zeros that I extracted and sent in my last mail, come before another set of ones from various languages. A UTF8 value is basically a dangerous way to sort characters if you're dealing with anything bar bare English, and even with just English, as we have seen, punctuation and spacing still provide problem areas. >Why? There is no special encodings or special sorting attributes. >Both engines receive the same 'lang' environment ( Or they dont??? ) , >why they employ different sort orders? Probably because Mike Kay and James Clark think that different rules apply to sorting in English, although it's possible that one of the processors is sorting based on something other than a lang-dependent order. >I still think something is strange here. They both are sorting UTF8 (?) >without any special cases mentioned in the W3C paper and the >question is : "in UTF8(?) what comes first '-' or '0' ?" - Right? >Is it legal they are giving the different ansewers to teh same question? No, the question is: "in English, what comes first: '-' or '0'?". It is legal for them to give different answers, it's even compliant of them, it's just not particularly helpful :) >> Eventually the differences between them should be >> diminished through the specification of additional attributes. > >Pardon, what attrubutes do you mean ??? >From the XSLT Recommendation: "Future versions of XSLT may provide additional attributes to provide control over these variations. Implementations may also use implementation-specific namespaced attributes on xsl:sort for this." For example, Mike could add an extension attribute to xsl:sort called saxon:ignore-hyphens. When the value is 'yes', then hyphens are simply ignored (and '-1' will sort after '0'); when the value is 'no', then hyphens are taken into account (and '-1' will sort before '0'). Or in the next version of XSLT, there might be an xsl2:alternate-weighting attribute defined on xsl:sort with the values of 'blanked', 'non-ignorable' and 'shifted', each giving different weightings to collation elements like hyphens and spaces as described in [http://www.unicode.org/unicode/reports/tr10/index.html#Alternate Weighting]. >I now think maybe this is is the bug in XT ? It's certainly possible that XT doesn't employ lang-specific lexicographic sort orders, but I think it's unlikely. Ideally, XSLT Processors would document the rules they use to sort text; the differences between them would form the input into the set of attributes for xsl:sort in the next version of XSLT; and all the XSLT Processors would then implement the variant sorts. Then you, as the stylesheet author, would be able to specify which type of sort you wanted, and be able to consistently get it across XSLT Processors. But I don't think that this is a matter of 'right' and 'wrong' at the moment. Cheers, Jeni Dr Jeni Tennison Epistemics Ltd * Strelley Hall * Nottingham * NG8 6PE tel: 0115 906 1301 * fax: 0115 906 1304 * email: jeni.tennison@xxxxxxxxxxxxxxxx XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format