|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] If XML is too hard for regular expressions, perhaps he'd be better off w
At 07:40 PM 3/29/2003 -0500, Elliotte Rusty Harold wrote: >No, validation doesn't help because it has absolutely nothing to say about >comments, processing instructions, CDATA sections, white space in tags, or >character entities and very little to say about entity references. It's >just too hard to tell what is and isn't the string you're looking for >without using a genuine parser. This is what I had thought most people would expect - regular expressions are not normally what you use to parse something described by a BNF. And it also agrees with other things I have read, eg [1]: This is a example of how NOT to process XML using Perl. Please don't use regular expressions on XML, in the very short run you will be bitten. This was by far the most painful example to write and although it does the job it will break for the next version of the RFC. Entity resolution especially is much easier if you use a parser. And of course, with entities, default values, namespace prefixes, and CDATA sections, it's really quite difficult to interpret a document based only on textual patterns in the document instance itself. Many people here will remember the following example from [2], where these two elements must be treated as identical: <item xmlns:dc="http://purl.org/dc/elements/1.1/"> <title>MetaData</title> <dc:date>2003-01-12T00:18:05-05:00</bc:date> <link>http://bitworking.org/news/8</link> <description>Upon waking, the dinosaur...</description> </item> <root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" > <root:title>MetaData</root:title> <bc:date>2003-01-12T00:18:05-05:00</bc:date> <root:link>http://bitworking.org/news/8</root:link> <description>Upon waking, the dinosaur...</description> </root:item> Of course, Joe wanted to solve this problem with an XML subset based on the following rules: 1. All namespace declarations must be done in the root element. 2. Never a declaration for the "" namespace. I.e. if an element sits the "" namespace then the element name will never have a namespace qualifier. 3. No CDATA sections. 4. No DTDs. Those rules are fine if you control the production of the XML as well as the consumption. If you don't, you need to be able to interpret whatever XML someone throws at you, and that may be too hard for regular expressions. Isn't the lesson simply that you need a parser to interpret XML? And if so, why is that a problem? Most languages I use require a parser... Jonathan [1] Ways to Rome: Processing XML with Perl Original version by Ingo Macherius, <macherius@g...> Maintained by Michel Rodriguez, <m.v.rodriguez@i...> Version: 2.1: 2002-09-17 http://xmltwig.cx/perl_survey/perl_survey.html [2] Regex-able XML: Is there a Regex-able subset of XML? Joe Gregoriohttp://bitworking.org/news/40
|
Purchase Stylus Studio Online Today!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||






