converting character entities to us-ascii /equivalents/

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: XML Developers List <xml-dev@l...>
Subject: converting character entities to us-ascii /equivalents/
From: Robert Koberg <rob@k...>
Date: Wed, 06 Oct 2004 14:55:58 -0700
User-agent: Mozilla Thunderbird 0.7 (Macintosh/20040616)

Hi,

I need to output several versions of a page (through XSL 
transformations), one of which is us-ascii (for email). But, the content 
might contain some characters that are not supported by us-ascii (like 
em dash - &#151;).

I want the character entities to remain in the content. When 
transforming to us-ascii, I want to replace the entities with some ascii 
text equivalent: For example, '&#151;' would get converted to '--'.

The XML is pulled into the transformation through the document function 
using a custom URIResolver.

Is there an existing solution to this?

Does Apache's FOP and the text renderer handle this type of thing?

I have tried to set a ContentHandler (actually a DefaultHandler) on the 
XMLReader and tried to replace a character entity, but I am doing 
something wrong and a confused on how to proceed. Using the code below I 
get a recoverable error using saxon/aelfred and a failure when using 
saxon/xerces.

Here is a snippet from the URIResolver:


InputSource in = new InputSource(file.getAbsolutePath());
SAXSource source = new SAXSource(in);
XMLReader reader = null;
try {
   reader = 
XMLReaderFactory.createXMLReader("com.icl.saxon.aelfred.SAXDriver");
   //reader = 
XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
} catch (SAXException e) {
   System.err.println(e.getMessage());
}

reader.setContentHandler(new AsciiHandler());

source.setXMLReader(reader);

return source;



And the DefaultHandler has one method:


public void characters(char[] text, int start, int length) {

   String str = new String(text, start, length);
   if (str.indexOf(174) > -1) {
    str.replaceAll("\u00AE", "(Registered Trademark)");
   }
   text = str.toCharArray();
}

How can I do this? Is there a better way to handle this type of thing?

thanks,
-Rob

Follow-Ups:
- Re: converting character entities to us-ascii /equivalents/
  - From: Alexander Savenkov <savenkov@x...>
- RE: converting character entities to us-ascii /equivalents/
  - From: "Michael Kay" <michael.h.kay@n...>

Prev by Date: RE: Blogging Systems
Next by Date: RE: converting character entities to us-ascii /equivalents/
Previous by thread: Atom Hackathon to be featured at XML 2004
Next by thread: RE: converting character entities to us-ascii /equivalents/
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >