[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Need to remove unusual character in source

Subject: Re: Need to remove unusual character in source
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Wed, 27 Sep 2006 01:02:52 +0200
perl remove control character
Mario Madunic wrote:
the character is and its a control character

0x18 CAN

Unfortunately, that says it all. Control characters are not allowed in UTF-8 and as a result, are not allowed in XML, when the encoding is UTF-8 (making XML not well-formed)


the error message I recieve is
SXXP0003: Error reported by XML parser: Illegal XML character: &#x18;.

This is indeed illegal. The other day I accidentally used &#x08;, which is also illegal (I had it mistaken for a tab character, x09, which *is* legal) .


I've tried using ANT to clean it out but with no luck using native2ascii or
escapeunicode

Won't help either. Escaping these characters will not help. But you are on the right track: use a filter to remove this character, or replace it with something useful. I use a filter to get Micrososft *.msg format, which has some useful lines, but the rest are control characters and other illegal data. Here's what it might look like when you'd resort to using Ruby (you can call it from Ant if you like), see www.ruby-lang.org.


(spoiler warning: this is off-topic and only marginally related to xslt)


# create working dir if not FileTest::exist?('trimmed') Dir.mkdir('trimmed') end

Dir.entries(".").each do |fn|
if fn =~ /\.yourextension/
# open file and set it to binmode
file = File.new(fn)
file.binmode
# read complete file contents and scan it
newfile = File.new("trimmed/#{fn}.txt", 'w')
file.gets(nil).scan(/[^\x18]+/m) do |found|
newfile.puts(found);
end
end
end



Just replace "yourextension" with the extension of your file and replace "trimmed" with an output dirname of your choice. Replace '.txt" with whatever extension you would like yourself. It runs through the current directory and copies all files to the "trimmed" directory, with one change: the x18 character is removed.


Of course, you can use Perl, a DOS Batch file (takes some practice), Bash, VBScript, PHP, Grep, Awk or any other tool you'd prefer.

HTH,

Cheers,
Abel Braaksma
http://abelleba.metacarpus.com



Can this be done or do I need to ask the client to remove it from their data,
which might not be an option?

Any help or insight would be greatly appreciated.

Marijan Madunic

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Cast Your Vote

We need your help – Vote for DataDirect XML Products!

  • Best SOA or XML site

Winners and finalists announced at SOA World Conference in November.

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.