[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Tool converts records to XML
Roger L Costello <costello@mitre.org> writes: > Michael Kay wrote: > >> the "Barnes & Noble" problem. The number #1 blunder >> when writing XML is not to bother escaping `<` and `&` >> if they happen to occur in your input. > > Ouch! > > You are right Michael. > > Upon reflection, I realized that there is an even nastier problem > lurking than the problem of converting & and < in the input record > data into & and < in the output XML. > > ... > > To implement the character conversions in AWK would be a monumental task. > > Eeeeeeek! > > Lesson Learned: Don't use AWK to convert records to XML. Well, you may be right, and I believe many on this list share my preference for performing such conversions in XSLT and/or XQuery, but I have to say that the lesson you suggest seems a slightly broader conclusion than is warranted by the experience you describe. A couple points of detail: - Your downstream tools are likely to be somewhat happier if you convert the data to UTF-8 or UTF-16, but unless I am mistaken you are not in fact required to do so, in order to turn the data into XML. XML does allow encoding declarations. - If you do want to convert the encoding it would surprise me a bit if awk had no constructs suitable for the work. It would surprise me even more if a system with awk did not have the iconv utility for converting textual data from one encoding to another. iconv --from-code=WINDOWS-1252 --to-code=UTF-8 < myinput > output.utf8 - Your note sounds as if you found it difficult to contemplate the horrifying task of escaping occurrences of & and < -- I don't see what you regard as so difficult. Of all the text formats I have worked with, XML is among the simplest as regards the number and nature of its rules, and especially the number of its magic characters. It has two, count 'em, two magic characters in ordinary textual content: ampersand and left angle bracket. (Add the magic string ']]>' if for some reason you choose to generate CDATA marked sections. Add the escaping of the delimiters if you are generating attribute values.) This contrasts favorably, in my experience, with the number of characters you need to take care to escape if you are generating other formats, for example TeX. I have never seen the first cut at a TeX to XML conversion in which the programmer remembered to unescape all the escaped characters; I have never seen my first cut at a TeX document remember to escape all the characters that need escaping. You will perhaps be saying now that tab- and comma-delimited formats are simpler. But even tab- or comma-delimited formats are likely to have at least two magic characters and maybe more: they need to escape at least their main delimiter (tab or comma), and then also to escape whatever mechanism is used to escape tab or comma: if some values are quoted, there will need to be ways to escape quotation marks within quoted values; if backslash escaping is used, backslash must also be escaped. One reason I have come to despise CSV is that I have come across so many pieces of software which claim to accept CSV but whose authors have botched the parsing. (One of them had no way to allow commas in strings.) I wonder if the lesson to be learned might more accurately be formated as "When writing any program, pay attention to the formal definitions of your input and your output; if you don't, you are likely to produce output that is not in the specified output format." If you are producing XML, you have at least the advantage that your downstream data consumers are likely to tell you what's wrong, instead of accepting the bad data and silently producing bad results. -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|