[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Exploiting multi-core CPUs during XML parsing

  • To: xml-dev@l...
  • Subject: Re: Exploiting multi-core CPUs during XML parsing
  • From: Tatu Saloranta <cowtowncoder@y...>
  • Date: Sat, 1 Apr 2006 14:23:52 -0800 (PST)
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=j4PpXQoNO6c/ePCS65EZJQcxYukAAi01isFOqVAWm2OVoimD8qJJ30d6oinDDWN/oSC8eXsJoWaQD3UN0o0Td3o73RqJdIZnwFBdSvq44bUFsdzEWD3gj2i+K8B3qmVzPVIlTmVAZC6DbUe9hHuC1Cvq+MAxnWunEOsSERfrxGk= ;
  • In-reply-to: <442E2A2A.1070003@p...>

tatu parsing
--- Sean McGrath <sean.mcgrath@p...> wrote:

> I have sketched out an algorithm for fast XML WF
> parsing utilising two 
> threads that each start at opposite ends of the
> octet-stream and meet in 
> the middle. The algorithm hinges on the fact that
> start- and end-tags 
> are balanced. i.e. as one thread reads forward
> looking for foo 
> start-tag, the other thread is reading backwards
> looking for foo end-tag.
> 
> This also has the nice side effect of giving you
> accurate error messages 
> quickly. i.e. as soon as a mismatched tag is found,
> it can be reported. 

Keep in mind that in general you have lots of
subtrees, so you can not really assume that one thread
only matches start elements, and the other end
elements. So you don't really know which start tag
matches which end tag, before parsing the whole file.

> This is particularly useful with recursive element
> types.

This could work for a limited subset of XML, but there
are a few gotchas. Some problems you may face are:

* Namespace resolution pretty much has to be done in
document order
* Entity expansion is tricky to do; you need to
backtrack (from reverse reader) when hitting a '&'.
 Also, when resolving external entities, you may have
to read the whole external entity in-memory first, to
find the end (from reverse reader).
* CDATA sections need special handling. Fortunately,
]]> is not allowed anywhere in textual content, so you
can match that. However, more serious problem is how
to match the opening delimited, since that is allowed
to be repeated in CDATA, like:
"<![CDATA[<![CDATA[<![CDATA ...]]>"
* Processing instructions (like CDATA) are a pain to
parse, since they can have as many start markers as
they want ("<? proc instr <? <? <? <? ... .?>").
* Comments are quite easy, fortunately, as they can
not contain '--'.
* Handling of internal subset probably has to be done
in forward (non-reverse) order. Not a huge issue
(since it's near the start, but something to keep in
mind.

Also, another question is how are you planning to
combine the results? I assume you'd build a DOM
(-like) result tree, since it's not easy to think of a
useful stream abstraction.

Anyway, it may still be a useful exercise in figuring
out how to do things, although chances are there may
not be significant speedup in the end ;-)

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.