|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Detection of non-Unicode characters
From: "Mark Feblowitz" <mfeblowitz@f...> > We've gotten ourselves in a slight muddle. We've copied Word documentation > into (many) xs:annotation blocks in our UTF-8 .xsd files (there are around > 300 files). In the process, we have apparently brought along some > non-Unicode characters. This is not tolerated equally well by all tools. > > Is there a convenient means of scanning .xsd files to locate non-Unicode > characters? I'm looking for something like a Windows command line filter. > > Any idea where I can find such a beast? If you have a programmer on tap, you would probably be better off writing a quick C (or Python or Perl) program to do this. It is a state machine with two states S1 and S2 and a transition T1 from S1 to S2, and a transition T2 from S2 to S1. In S1, read a byte in, write it out, appending it to a byte buffer. When you find "<xs:annotation" go T1 and clear the byte buffer. In S2, read a byte in and translate it to UTF-8, then write out the bytes. If you find "</xs:annotation" go T2 and clear the byte buffer. In all probability (unless you have East Asian annotations, or UTF-16 annotation) your bogus text is encoded in 8859-1 or MacRoman or CP1252, which are just single bytes. So that is quite easy. But before doing this, confirm the encodings used in the XML document and the Word fragments. In no circumstances try to read the document in as XML, because it will surely corrupt the data further, and you may not be able to go back. If you use Java, read everything as bytes not as Characters, because reading in the characters will cause transcoding and therefore corruption. If the data is already sitting in a datastructure in a program, then serialize it out so that each xs:annotation is in a different entity with the appropriate encoding header. External entities can all be in different encodings. Then just parse the document as normal XML and the parser will take care of this for you. XML already provides these facilities to cope. If your XML parser does not handle external entity references properly, get rid of it and switch to professional quality tools. Finally, if you don't have a programmer on tap, then use the same tool you used to plonk the word documentation into the xs:annotations and cut and paste them into their own entities, with the correct encoding headers. This is tedious but low-tech. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








