Re: Getting the Base Character of Character with Diac

Play the video

Subject: Re: Getting the Base Character of Character with Diacritic
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Tue, 19 Sep 2006 19:59:34 +0200

Sorry for opening an otherwise closed thread, but I thought this of interest to the questioner. I just tumbled across this definition while researching something else. You can find it in the official XML TR at W3C: http://www.w3.org/TR/2006/REC-xml-20060816/#NT-CombiningChar.

In addition I thought of using the following as regular expression, when you find Michael Key's solution sufficient. It does about the same, but I consider it slightly more readable (opinions may vary ;-)

replace($yourtext, '(\p{IsCombiningDiacriticalMarks}|\p{IsCombiningMarksforSymbols})', '')

The following is probably the most-complete definition of chars that can be used for combining different characters. Just for the sake of completeness, I copy it here (note: this is Unicode 3.1, because the XML spec only requires Unicode 3.1 conformance. Codes added in 3.2 and 4.0, like x034B-034F, are not in this list):

CombiningChar ::= [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] | #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099 | #x309A

Jeff Sese wrote:

Thanks Abel, Colin and Sir Mike, for the suggestions; it was what i wanted. -- Jeff

Michael Kay wrote:
Following up on suggestions from others, if NFKD is supported then the
following should work reasonably well for European languages:
replace(normalize-unicode($in, 'NFKD'), '[̀-ͯ]', '')

or if you prefer

Current Thread
RE: Getting the Base Character of Character with Diacritic, (continued) Michael Kay - 19 Sep 2006 08:02:57 -0000 Colin Adams - 19 Sep 2006 08:12:39 -0000 Michael Kay - 19 Sep 2006 07:59:36 -0000 Jeff Sese - 19 Sep 2006 08:06:08 -0000 Abel Braaksma - 19 Sep 2006 17:59:53 -0000 <= Abel Braaksma - 19 Sep 2006 08:57:14 -0000

<- Previous	Index	Next ->
Re: Getting the Base Characte, Jeff Sese	Thread	Re: Getting the Base Characte, Abel Braaksma
Removing line breaks without , Mark Peters	Date	Re: Removing line breaks with, Spencer Tickner
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >