Re: Detection of non-Unicode characters

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: Matt Gushee <mgushee@h...>,xml-dev@l...
Subject: Re: Detection of non-Unicode characters
From: Ann Navarro <ann@w...>
Date: Mon, 26 Aug 2002 10:05:13 -0400
In-reply-to: <20020823223630.GA440@s...>
References: <3D66A823.2060109@t...><4DBDB4044ABED31183C000508BA0E97F040ABF38@f...><3D66A823.2060109@t...>

At 04:36 PM 8/23/2002 -0600, Matt Gushee wrote:

>I would bet it's this. Just this past week I have been debugging a
>broken application that is supposed to generate XML from Word documents.
>The main problem I found was that the Word documents are full of
>characters like 0x07, 0x2012-0x2019, and the like. The latter range
>consists of common punctuation symbols like dashes and left and right
>quotes (AKA 'smart quotes'). They appear to be using Code Page 1252
>mapped directly into Unicode.

I just ran into this myself, with a styled apostrophe character -- which 
was only reported as a problem by XML Spy 4.4 upon opening the 1.2MB XML 
file (character was: Â (0xC2), ' (0x92)).

All three validators I have (Xerces standalone, XMetal 3.0, and XML Spy 
4.4) reported the file valid, but it was failing upon import into a content 
management system (with the less than helpful error of "no root element 
present", when there clearly was).

A tool that would quickly locate these kinds of things would be enormously 
helpful (I'd certainly buy a copy if it were commercial/shareware).

Ann
-----
Ann Navarro, WebGeek, Inc.
http://www.webgeek.com

say what? http://www.snorf.net/blog

Follow-Ups:
- Re: Detection of non-Unicode characters
  - From: "Rick Jelliffe" <ricko@a...>

References:
- Re: Detection of non-Unicode characters
  - From: Tim Bray <tbray@t...>
- Detection of non-Unicode characters
  - From: Mark Feblowitz <mfeblowitz@f...>
- Re: Detection of non-Unicode characters
  - From: Matt Gushee <mgushee@h...>

Prev by Date: Re: Architectural Forms (was Re: XHTML 2.0 and the death)
Next by Date: RE: Architectural Forms revival?
Previous by thread: Re: Detection of non-Unicode characters
Next by thread: Re: Detection of non-Unicode characters
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >