[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: A simple guy with a simple problem
Sean McGrath, er, "Bob" wrote: > Hello, my name is Bob and I'm a programmer. Hi Bob! > I work for a B2B company. My task today is to process > incoming XML documents that are known to be valid against the foo > DTD and change all occurences of the word "STUFF" to "stuff". > > I need to leave the documents otherwise unchanged in all material > respects as they are going on to a third company in a B2B chain. The requirements are a bit fuzzy: for example, if the input contains "ASDFSTUFFQWERTY", does that count as an occurrence of the word "STUFF"? Also, the precise definition of "all material respects" is unclear (I gather that this is the real question). At any rate, your best bet is to use sed (or the moral equivalent): sed -e 's/\<STUFF\>/stuff/g' if STUFF should only be matched as a complete word, or sed -e 's/STUFF/stuff/g' if the character sequence 'STUFF' should be matched anywhere. This is guaranteed not to disturb any of the markup [*], since fortunately the DTD: > <!ELEMENT foo (lit)*> > <!ELEMENT lit (#PCDATA)> > <!ATTLIST lit text CDATA "STUFF"> doesn't use "STUFF" as an element or attribute name. If it did, you'd have a harder task. [*] Actually, this is a lie: it will break if the document starts with <!DOCTYPE foo SYSTEM "http://www.baz.com/STUFF/...">, if it uses an internal general entity named "stuff" and another named "STUFF". (If it *only* contains one called "STUFF", it's unclear whether renaming this to "stuff" constitutes an unacceptable material change.) There may be other corner cases as well. To solve the harder task, there are three possible solutions: (1) Use an off-the-shelf SAX parser, perform the substitution on text events and attribute values, and reserialize it; the output will have the same Infoset as the input, modulo the required changes. This approach will only work if the precise lexical structure is immaterial. If this is not the case then I suggest (2) convince management to change this requirement and implement solution (1). If this isn't feasible, you'll need to (3) perform the transformation and reserialization at the level of XML lexical tokens. The EXPAT parser has an internal interface that reports individual XML lexemes; you could base your program on that. Or you could write your own tokenizer; it's tedious but not difficult. Hope this helps, --Joe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|