Re: Detection of non-Unicode characters

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: <xml-dev@l...>
Subject: Re: Detection of non-Unicode characters
From: "Rick Jelliffe" <ricko@a...>
Date: Sat, 24 Aug 2002 07:29:24 +1000
References: <4DBDB4044ABED31183C000508BA0E97F040ABF38@f...>

From: "Mark Feblowitz" <mfeblowitz@f...>

> We've gotten ourselves in a slight muddle. We've copied Word documentation
> into (many) xs:annotation blocks in our UTF-8 .xsd files (there are around
> 300 files). In the process, we have apparently brought along some
> non-Unicode characters. This is not tolerated equally well by all tools.
> 
> Is there a convenient means of scanning .xsd files to locate non-Unicode
> characters? I'm looking for something like a Windows command line filter.
> 
> Any idea where I can find such a beast?

If you have a programmer on tap, you would probably be better off 
writing a quick C (or Python or  Perl) program to do this. 

It is a state machine with two states S1 and S2 and a transition
T1 from S1 to S2, and a transition T2 from S2 to S1.

In S1, read a byte in, write it out, appending it to a byte buffer. 
When you find "<xs:annotation" go T1 and clear the
byte buffer.

In S2, read a byte in and translate it to UTF-8, then write out
the bytes.  If you find "</xs:annotation" go T2 and clear the
byte buffer.

In all probability (unless you have East Asian annotations, or
UTF-16 annotation) your bogus text is encoded in 8859-1 or
MacRoman or CP1252, which are just single bytes.  So that
is quite easy. 

But before doing this, confirm the encodings used in
the XML document and the Word fragments. In no circumstances
try to read the document in as XML, because it will surely
corrupt the data further, and you may not be able to go back.

If you use Java, read everything as bytes not as Characters, 
because reading in the characters will cause transcoding
and therefore corruption. 

If the data is already sitting in a datastructure in a program,  then
serialize it out so that each xs:annotation is in a different
entity with the appropriate encoding header.  External entities
can all be in different encodings.  Then just parse
the document as normal XML and the parser will take care
of this for you.  XML already provides these facilities to
cope. If your XML parser does not handle external entity
references properly, get rid of it and switch to professional
quality tools.

Finally, if you don't have a programmer on tap, then use
the same tool you used to plonk the word documentation
into the xs:annotations and cut and paste them into their
own entities, with the correct encoding headers.  This
is tedious but low-tech. 

Cheers
Rick Jelliffe

References:
- Detection of non-Unicode characters
  - From: Mark Feblowitz <mfeblowitz@f...>

Prev by Date: Re: Detection of non-Unicode characters
Next by Date: namespaces and entities: a thought experiment
Previous by thread: Re: Detection of non-Unicode characters
Next by thread: namespaces and entities: a thought experiment
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >