[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: National Language Collating Sequences and Index Ge

Subject: Re: National Language Collating Sequences and Index Generation
From: Joerg Pietschmann <joerg.pietschmann@xxxxxx>
Date: Fri, 08 Feb 2002 10:39:03 +0100
language collating sequence
"W. Eliot Kimber" <eliot@xxxxxxxxxx> wrote:
> I have to generate back-of-the-book indexes for many national languages,
> including Arabic, Hebrew, Thai, Simplified Chinese, Traditional Chinese,
> Korean, and Japanese. I've successfully adapted the Docbook index
> generation code to produce the basic index, but now I'm faced with the
> challenge of both doing correct sorting for these languages and
> generating the appropriate index groups.

That's an interesting topic and a real, already acknowledged but
in general not quite solved problem.
In XSLT 1.0, xsl:sort sorts strings lexically by Unicode code point
number, IIRC. Localized sorting by a single character should also
relatively easy to implement if you can get hold of the collating
sequence:

  <xsl:stylesheet ...
     xmlns:coll="my.collating.sequence"/>
  <coll:sequence>
    <char char="A" number="1"/>
    <char char="B" number="2"/>
   ...
  </coll:sequence>
  <xsl:variable name="collseq" select="document('')/*/coll:sequence"/>
  ...
    <xsl:for-each select="$items">
      <xsl:sort select="$collseq[@char=substring(current()/name,1,1)]/@number"/>

You can try to add
      <xsl:sort select="$collseq[@char=substring(current()/name,2,1)]/@number"/>
and so on for more compete lexical sorting.
It could be of some use that you could define fractional numbers for
the sorting keys:
    <char char="A" number="1"/>
    <char char="&Auml;" number="1.1"/> <!-- sorry for the entity :-) -->
    <char char="a" number="1.5"/>
The caveats are that you better have a complete collating sequence,
and that you shouldn't expect a great performance, especially if you
add a lot of sort clauses. There is also the possibility that you run
afoul unexpected character normalisation issues, users could expect
that &#xE4; and &#x61;&#x0308; are interchangable (at least i think so).

In XSLT/XPath 2.0, you can have named collating sequences, but you
shouldn't expect the ones you need are provided by the runtime
system :-((((

HTH
J.Pietschmann

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.