Re: National Language Collating Sequences and Index Ge

Play the video

Subject: Re: National Language Collating Sequences and Index Generation
From: Joerg Pietschmann <joerg.pietschmann@xxxxxx>
Date: Fri, 08 Feb 2002 10:39:03 +0100

"W. Eliot Kimber" <eliot@xxxxxxxxxx> wrote:
> I have to generate back-of-the-book indexes for many national languages,
> including Arabic, Hebrew, Thai, Simplified Chinese, Traditional Chinese,
> Korean, and Japanese. I've successfully adapted the Docbook index
> generation code to produce the basic index, but now I'm faced with the
> challenge of both doing correct sorting for these languages and
> generating the appropriate index groups.

That's an interesting topic and a real, already acknowledged but
in general not quite solved problem.
In XSLT 1.0, xsl:sort sorts strings lexically by Unicode code point
number, IIRC. Localized sorting by a single character should also
relatively easy to implement if you can get hold of the collating
sequence:

  <xsl:stylesheet ...
     xmlns:coll="my.collating.sequence"/>
  <coll:sequence>
    <char char="A" number="1"/>
    <char char="B" number="2"/>
   ...
  </coll:sequence>
  <xsl:variable name="collseq" select="document('')/*/coll:sequence"/>
  ...
    <xsl:for-each select="$items">
      <xsl:sort select="$collseq[@char=substring(current()/name,1,1)]/@number"/>

You can try to add
      <xsl:sort select="$collseq[@char=substring(current()/name,2,1)]/@number"/>
and so on for more compete lexical sorting.
It could be of some use that you could define fractional numbers for
the sorting keys:
    <char char="A" number="1"/>
    <char char="&Auml;" number="1.1"/> <!-- sorry for the entity :-) -->
    <char char="a" number="1.5"/>
The caveats are that you better have a complete collating sequence,
and that you shouldn't expect a great performance, especially if you
add a lot of sort clauses. There is also the possibility that you run
afoul unexpected character normalisation issues, users could expect
that &#xE4; and &#x61;&#x0308; are interchangable (at least i think so).

In XSLT/XPath 2.0, you can have named collating sequences, but you
shouldn't expect the ones you need are provided by the runtime
system :-((((

HTH
J.Pietschmann

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread
National Language Collating Sequences and Index Generation W. Eliot Kimber - Thu, 7 Feb 2002 20:11:20 -0500 (EST) Jeni Tennison - Fri, 8 Feb 2002 06:39:44 -0500 (EST) <Possible follow-ups> Joerg Pietschmann - Fri, 8 Feb 2002 04:36:26 -0500 (EST) <= Michael Kay - Fri, 8 Feb 2002 06:36:57 -0500 (EST)

<- Previous	Index	Next ->
Re: National Language Collati, Jeni Tennison	Thread	RE: National Language Collati, Michael Kay
counter in xsl, thenewmatrix	Date	RE: querystring parameters, Andrew Welch
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >