Re: special character encoding, two problems

Play the video

Subject: Re: special character encoding, two problems
From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 23 Oct 2014 21:13:45 -0000

On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx
scripsit:
> Thanks for the advice! The <xsl:value-of
> select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"
> /> function works for most of the entities, but it's still missing a
> couple dozen characters. 

Terminology pedant time --

&#x00e9; is a numeric entity and exactly the same thing as C) just
written differently.

&eacute; is a named entity reference (which had better be defined
somewhere)

Either, as soon as the XML document is parsed, turns into U+00E9 in some
internal representation and they're not different from each other or the
representation for C) if someone had typed that directly in the utf-8
input file.

So when you say "entity" here I'm getting the nervous feeling that I
don't know what you mean; can you provide some examples?

> Some of the author names still have unicode entities instead of plain
> ascii (for example, several characters with a stroke, several
> ligatures, thorn characters, upper and lowercase). Is there a

Well, examples would be good, but thorn, for example, &#x00FE; which is
the self-same code point as C>, doesn't involve a modifier; it's one
whole letter that doesn't exist inside ASCII.

Stripping the modifiers -- which will give you e from C) if you decompose
C) first, because then it's e + K
, which you could write &#x0065; +
&#x0301; and it would be the same -- doesn't do anything because there
is no modifier there, it's just the single code-point for thorn.

> variation of this function or a parameter that will catch and convert
> ALL of these to plain ascii, as well as the standard acute and cedil
> characters? Or do I need to address these outlying characters with
> something else (not translate, since I can't use a one-to-one
> replacement for ligature entities)?

ASCII, strictly, is seven-bit; there are lots of things you can't
represent in ASCII.  &#x00e9; *is not* ASCII just because those eight
characters all happen to be ASCII characters.

So it sounds like you're trying to (either) map U+00FE, C>, to &thorn; or
something like that (which is not, I cannot stress too much, ASCII; it
might be an ASCII representation of a non-ASCII code-point, but it's
still a non-ASCII code-point) or have C> decompose into t+h or something
of that ilk.  (Which is at least actually ASCII.)

Either way you'd have to use character mappings for those; there aren't
any modifiers to remove.

Are you really compelled to deliver seven bit ASCII?

And, please, some examples.

-- Graydon

Current Thread

Re: special character encoding, two problems, (continued)
- Jonina Dames jdames@xxxxxxxxx - 23 Oct 2014 20:39:00 -0000
  - Graydon graydon@xxxxxxxxx - 23 Oct 2014 21:13:45 -0000 <=
    - Eliot Kimber ekimber@xxxxxxxxxxxx - 24 Oct 2014 13:11:37 -0000
    - Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 16:27:05 -0000
    - Michael Kay mike@xxxxxxxxxxxx - 24 Oct 2014 16:54:33 -0000
    - Jonina Dames jdames@xxxxxxxxx - 24 Oct 2014 17:10:43 -0000

<- Previous	Index	Next ->
Re: special character encodin, Jonina Dames jdames@	Thread	Re: special character encodin, Eliot Kimber ekimber
Re: special character encodin, Jonina Dames jdames@	Date	FO: Scaling and centering con, Michael Müller-Hille
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >