[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XML Parsing using DOM parser


dom parse cdata

>At 10:29 AM 02/01/02 -0800, Deepa Venkatesan wrote:
> >           Kindly share your thoughts.. I recently
> >wrote a code to parse an XML file containing catalog
> >content (as big as 10MB) using DOM parser. The
> >performance has been miserable particularly when the
> >XML file size increased. The problem is that using a
> >SAX parser (the only other alternative that strikes
> >me) I would have to re write the complete XML and the
> >code for this would be really elaborate. My final
> >objective of parsing to change 2 lines for every
> >catalog item (the XML file has as many as 3000 catalog
> >items).

[Tim Bray]

>This may be a job for perl or python.  Both have XML parsers;
>in perl and I assume python these can be up with a bit of work
>to pass everything through and let you fiddle with just the
>pieces you want.  If the incoming data was generated by a
>machine it's quite likely sufficiently regular that you don't
>even need to use the XML parser, just pattern-match for the
>tags you care about. This will run faster and be less work
>to write. -Tim

...with the caveat that both innocent and malevolently crafted,
fully 1.0 compliant XML , may blow your application out
of the water if you by-pass WF parsing in this way.

Lets say your pattern matcher is triggering on
<invoice> start-tags, likely candidates for problems
include
         comments
         CDATA sections
         General Entity Refs


Comments:
         <!-- this ain't no <invoice> start-tag -->

CDATA sections
         <![CDATA[
         this ain't no <invoice> start-tag
         ]]>

Generally entity Refs:
         <!DOCTYPE foo [
         <!ENTITY bar SYSTEM "bar.xml">
         ]>
         <foo>
         <!-- lots of invoices in here but your pattern-matcher will never 
see them -->
         &bar;
         </foo>

Oh and by the way, if your app needs to trigger on namespace qualified
tags pattern matching gets you into deep trouble if there are
default namespace decls around.

In my opinion, skipping WF parsing is too dangerous to countenance in
all but "throwaway" apps where you can live with the gotchas. For all other
cases, I'd advocate using a parser, and/or being more specific than saying
"use XML" when tieing down interchange notations.

I go on about this periodically [1] and am delighted to have this opportunity
to re-wind my broken record so early in 2002:-)

regards,
Sean

http://lists.xml.org/archives/xml-dev/200002/msg00232.html



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.