[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: A simple guy with a simple problem

  • From: Joe English <jenglish@f...>
  • To: xml-dev@l...
  • Date: Wed, 14 Mar 2001 09:46:24 -0800

simple guy definition

Sean McGrath, er, "Bob" wrote:

> Hello, my name is Bob and I'm a programmer.

Hi Bob!

> I work for a B2B company. My task today is to process
> incoming XML documents that are known to be valid against the foo
> DTD and change all occurences of the word "STUFF" to "stuff".
> I need to leave the documents otherwise unchanged in all material
> respects as they are going on to a third company in a B2B chain.

The requirements are a bit fuzzy: for example,
if the input contains "ASDFSTUFFQWERTY", does that
count as an occurrence of the word "STUFF"?
Also, the precise definition of "all material respects" 
is unclear (I gather that this is the real question).

At any rate, your best bet is to use sed (or the moral equivalent):

    sed -e 's/\<STUFF\>/stuff/g'

if STUFF should only be matched as a complete word, or

    sed -e 's/STUFF/stuff/g'

if the character sequence 'STUFF' should be matched anywhere.

This is guaranteed not to disturb any of the markup [*],
since fortunately the DTD:

>          <!ELEMENT foo (lit)*>
>          <!ELEMENT lit (#PCDATA)>
>          <!ATTLIST lit text CDATA "STUFF">

doesn't use "STUFF" as an element or attribute name.
If it did, you'd have a harder task.

[*] Actually, this is a lie: it will break if the document starts
with <!DOCTYPE foo SYSTEM "http://www.baz.com/STUFF/...">,
if it uses an internal general entity named "stuff"
and another named "STUFF".  (If it *only* contains one
called "STUFF", it's unclear whether renaming this
to "stuff" constitutes an unacceptable material change.)
There may be other corner cases as well.

To solve the harder task, there are three possible solutions:
(1) Use an off-the-shelf SAX parser, perform the substitution
on text events and attribute values, and reserialize it;
the output will have the same Infoset as the input, modulo
the required changes.  This approach will only work if
the precise lexical structure is immaterial.  If this
is not the case then I suggest (2) convince management
to change this requirement and implement solution (1).

If this isn't feasible, you'll need to (3) perform the
transformation and reserialization at the level of
XML lexical tokens.  The EXPAT parser has an internal
interface that reports individual XML lexemes; you could
base your program on that.  Or you could write your own
tokenizer; it's tedious but not difficult.

Hope this helps,



Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.