[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

If XML is too hard for regular expressions, perhaps he'd be better off w


cdata regular expression
At 07:40 PM 3/29/2003 -0500, Elliotte Rusty Harold wrote:
>No, validation doesn't help because it has absolutely nothing to say about 
>comments, processing instructions, CDATA sections, white space in tags, or 
>character entities and very little to say about entity references. It's 
>just too hard to tell what is and isn't the string you're looking for 
>without using a genuine parser.

This is what I had thought most people would expect - regular expressions 
are not normally what you use to parse something described by a BNF. And it 
also agrees with other things I have read, eg [1]:

         This is a example of how NOT to process XML using Perl. Please
         don't use regular expressions on XML, in the very short run
         you will be bitten. This was by far the most painful example
         to write and although it does the job it will break for the
         next version of the RFC. Entity resolution especially is much
         easier if you use a parser.

And of course, with entities, default values, namespace prefixes, and CDATA 
sections, it's really quite difficult to interpret a document based only on 
textual patterns in the document instance itself. Many people here will 
remember the following example from [2], where these two elements must be 
treated as identical:

<item xmlns:dc="http://purl.org/dc/elements/1.1/">
   <title>MetaData</title>
   <dc:date>2003-01-12T00:18:05-05:00</bc:date>
   <link>http://bitworking.org/news/8</link>
   <description>Upon waking, the dinosaur...</description>
</item>

<root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" >
   <root:title>MetaData</root:title>
   <bc:date>2003-01-12T00:18:05-05:00</bc:date>
   <root:link>http://bitworking.org/news/8</root:link>
   <description>Upon waking, the dinosaur...</description>
</root:item>

Of course, Joe wanted to solve this problem with an XML subset based on the 
following rules:

1.  All namespace declarations must be done in the root element.

2. Never a declaration for the "" namespace. I.e. if an element sits
the "" namespace then the element name will never have a namespace
qualifier.

3. No CDATA sections.

4. No DTDs.

Those rules are fine if you control the production of the XML as well as 
the consumption. If you don't, you need to be able to interpret whatever 
XML someone throws at you, and that may be too hard for regular expressions.

Isn't the lesson simply that you need a parser to interpret XML? And if so, 
why is that a problem? Most languages I use require a parser...

Jonathan

[1] Ways to Rome: Processing XML with Perl
     Original version by Ingo Macherius, <macherius@g...>
     Maintained by Michel Rodriguez, <m.v.rodriguez@i...>
     Version: 2.1: 2002-09-17
     http://xmltwig.cx/perl_survey/perl_survey.html

[2] Regex-able XML: Is there a Regex-able subset of XML?
     Joe Gregoriohttp://bitworking.org/news/40 


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.