[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XSLT function for title capitalization?

Subject: Re: XSLT function for title capitalization?
From: "Liam R. E. Quin liam@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 10 Apr 2018 06:19:32 -0000
Re:  XSLT function for title capitalization?
On Mon, 2018-04-09 at 20:52 +0000, David Sewell dsewell@xxxxxxxxxxxx
wrote:
> Wondering if anyone has a serviceable function (preferably in XSLT
> 2/3 but v1 is 
> fine if it works) that takes a string as input and returns it with
> title 
> capitalization according to English-language editorial practice (for
> example, 
> Chicago Manual of Style). 

I'd use replace() probably, rather than tokenizing, so as to change as
little as possible & facilitate regression tests.

Some test cases should include
* words that do and don't change at the start and at the end of input;
* words like o'clock and don't that include apostrophes, both as '
  and as b (it doesn't matter whether they are input as entities
  or literally or numeric character references though, as they all
  end up the same after XML parsing)
* hyphenated proper names like Rees-Mogg
* exceptions like Ladies-in-Waiting
* punctuation such as em dashes, quotes, commas, semicolons

Unfortunately XSLT doesn't give us Perl's wonderful e modifier on
substitution, and neither does XQuery (where it'd be more useful), but
XSLT does give us xsl:analyze-string. I'd start with David Carlisle's
approach and add a lot of test cases and fix the regexp to be something
more like
   (\w)(\w*(?:'\w+)?)
maybe.

An alternative is to replace (\w)'(\w) with $1E$2 everywhere, where E
is some Unicode upper-case letter or sequence of letters that
definitely doesn't occur in your input, and change it back at the end.

In XSLT 1 i'd cry for a while and then write something recursive that
split its input using translate() and substring-before() to find where
to split.

For https://words.fromoldbooks.org/Chalmers-Biography/ i use Perl, as
the input isn't well-formed XML at first, with a table of manual
overrides, but there are fewer than 10,000 entries i think. Once it's
in XMl my script/Makefile for conversion does use XSLT, taking 46
seconds to process 43MBytes of XML into 9771 separate XML files with
Saxon.

Liam


-- 
Liam Quin, W3C, http://www.w3.org/People/Quin/
Staff contact for Verifiable Claims WG, SVG WG, XQuery WG
Improving Web Advertising: https://www.w3.org/community/web-adv/
Personal: awesome vintage art: http://www.fromoldbooks.org/

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.