On Sun, Jul 03, 2016 at 04:13:09PM -0000, Terry Badger
terry_badger@xxxxxxxxx scripsit:
> Graydon, The document.xml I have found and worked with taken from a
> .docx file always have a prolog that has encoding="UTF-8" so I have
> not worried about invalid Unicode characters and can process any text
> in Word using an xsl stylesheet. Do you have a sample where a docx
> file has non Unicode encodings?
Not on hand, and if I did, it wouldn't be my data to share.
I've hit two cases of code point 96 -- a codepage 1252 n-dash -- in an
XSLT document (which is admittedly not Word) during paid work in the
last couple weeks, though. It does happen. It won't cause problems
until something checks for UTF-8 encoding specifically, rather than the
XML character set. It's entirely possible to have the whole XSLT
toolchain completely happy -- as it was in that case -- and something
downstream -- checking for encoding -- not happy at all. I have
certainly hit this problem with the XML versions of Office documents in
the past.
Pre-XML ver 5, it was possible to trust the parser to tell if your
document wasn't UTF-8 because XML's character set was a subset of UTF-8.
With ver 5, that's no longer the case.
-- Graydon
| Current Thread |
|
Graydon graydon@xxxxxxxxx - 3 Jul 2016 21:02:00 -0000 <=
|
|