Re: Exploiting multi-core CPUs during XML parsing

To: xml-dev@l...
Subject: Re: Exploiting multi-core CPUs during XML parsing
From: Tatu Saloranta <cowtowncoder@y...>
Date: Sat, 1 Apr 2006 14:23:52 -0800 (PST)
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=j4PpXQoNO6c/ePCS65EZJQcxYukAAi01isFOqVAWm2OVoimD8qJJ30d6oinDDWN/oSC8eXsJoWaQD3UN0o0Td3o73RqJdIZnwFBdSvq44bUFsdzEWD3gj2i+K8B3qmVzPVIlTmVAZC6DbUe9hHuC1Cvq+MAxnWunEOsSERfrxGk= ;
In-reply-to: <442E2A2A.1070003@p...>

Play the video

--- Sean McGrath <sean.mcgrath@p...> wrote:

> I have sketched out an algorithm for fast XML WF
> parsing utilising two 
> threads that each start at opposite ends of the
> octet-stream and meet in 
> the middle. The algorithm hinges on the fact that
> start- and end-tags 
> are balanced. i.e. as one thread reads forward
> looking for foo 
> start-tag, the other thread is reading backwards
> looking for foo end-tag.
> 
> This also has the nice side effect of giving you
> accurate error messages 
> quickly. i.e. as soon as a mismatched tag is found,
> it can be reported. 

Keep in mind that in general you have lots of
subtrees, so you can not really assume that one thread
only matches start elements, and the other end
elements. So you don't really know which start tag
matches which end tag, before parsing the whole file.

> This is particularly useful with recursive element
> types.

This could work for a limited subset of XML, but there
are a few gotchas. Some problems you may face are:

* Namespace resolution pretty much has to be done in
document order
* Entity expansion is tricky to do; you need to
backtrack (from reverse reader) when hitting a '&'.
 Also, when resolving external entities, you may have
to read the whole external entity in-memory first, to
find the end (from reverse reader).
* CDATA sections need special handling. Fortunately,
]]> is not allowed anywhere in textual content, so you
can match that. However, more serious problem is how
to match the opening delimited, since that is allowed
to be repeated in CDATA, like:
"<![CDATA[<![CDATA[<![CDATA ...]]>"
* Processing instructions (like CDATA) are a pain to
parse, since they can have as many start markers as
they want ("<? proc instr <? <? <? <? ... .?>").
* Comments are quite easy, fortunately, as they can
not contain '--'.
* Handling of internal subset probably has to be done
in forward (non-reverse) order. Not a huge issue
(since it's near the start, but something to keep in
mind.

Also, another question is how are you planning to
combine the results? I assume you'd build a DOM
(-like) result tree, since it's not easy to think of a
useful stream abstraction.

Anyway, it may still be a useful exercise in figuring
out how to do things, although chances are there may
not be significant speedup in the end ;-)

-+ Tatu +-

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

References:
- Exploiting multi-core CPUs during XML parsing
  - From: Sean McGrath <sean.mcgrath@p...>

Prev by Date: Re: Exploiting multi-core CPUs during XML parsing
Next by Date: Re: Exploiting multi-core CPUs during XML parsing
Previous by thread: RE: Exploiting multi-core CPUs during XML parsing
Next by thread: Re: Exploiting multi-core CPUs during XML parsing
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >