[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: special character encoding, two problems
On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx scripsit: > Thanks for the advice! The <xsl:value-of > select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')" > /> function works for most of the entities, but it's still missing a > couple dozen characters. Terminology pedant time -- é is a numeric entity and exactly the same thing as C) just written differently. é is a named entity reference (which had better be defined somewhere) Either, as soon as the XML document is parsed, turns into U+00E9 in some internal representation and they're not different from each other or the representation for C) if someone had typed that directly in the utf-8 input file. So when you say "entity" here I'm getting the nervous feeling that I don't know what you mean; can you provide some examples? > Some of the author names still have unicode entities instead of plain > ascii (for example, several characters with a stroke, several > ligatures, thorn characters, upper and lowercase). Is there a Well, examples would be good, but thorn, for example, þ which is the self-same code point as C>, doesn't involve a modifier; it's one whole letter that doesn't exist inside ASCII. Stripping the modifiers -- which will give you e from C) if you decompose C) first, because then it's e + K , which you could write e + ́ and it would be the same -- doesn't do anything because there is no modifier there, it's just the single code-point for thorn. > variation of this function or a parameter that will catch and convert > ALL of these to plain ascii, as well as the standard acute and cedil > characters? Or do I need to address these outlying characters with > something else (not translate, since I can't use a one-to-one > replacement for ligature entities)? ASCII, strictly, is seven-bit; there are lots of things you can't represent in ASCII. é *is not* ASCII just because those eight characters all happen to be ASCII characters. So it sounds like you're trying to (either) map U+00FE, C>, to þ or something like that (which is not, I cannot stress too much, ASCII; it might be an ASCII representation of a non-ASCII code-point, but it's still a non-ASCII code-point) or have C> decompose into t+h or something of that ilk. (Which is at least actually ASCII.) Either way you'd have to use character mappings for those; there aren't any modifiers to remove. Are you really compelled to deliver seven bit ASCII? And, please, some examples. -- Graydon
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|