[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Exploiting multi-core CPUs during XML parsing
--- Sean McGrath <sean.mcgrath@p...> wrote: > I have sketched out an algorithm for fast XML WF > parsing utilising two > threads that each start at opposite ends of the > octet-stream and meet in > the middle. The algorithm hinges on the fact that > start- and end-tags > are balanced. i.e. as one thread reads forward > looking for foo > start-tag, the other thread is reading backwards > looking for foo end-tag. > > This also has the nice side effect of giving you > accurate error messages > quickly. i.e. as soon as a mismatched tag is found, > it can be reported. Keep in mind that in general you have lots of subtrees, so you can not really assume that one thread only matches start elements, and the other end elements. So you don't really know which start tag matches which end tag, before parsing the whole file. > This is particularly useful with recursive element > types. This could work for a limited subset of XML, but there are a few gotchas. Some problems you may face are: * Namespace resolution pretty much has to be done in document order * Entity expansion is tricky to do; you need to backtrack (from reverse reader) when hitting a '&'. Also, when resolving external entities, you may have to read the whole external entity in-memory first, to find the end (from reverse reader). * CDATA sections need special handling. Fortunately, ]]> is not allowed anywhere in textual content, so you can match that. However, more serious problem is how to match the opening delimited, since that is allowed to be repeated in CDATA, like: "<![CDATA[<![CDATA[<![CDATA ...]]>" * Processing instructions (like CDATA) are a pain to parse, since they can have as many start markers as they want ("<? proc instr <? <? <? <? ... .?>"). * Comments are quite easy, fortunately, as they can not contain '--'. * Handling of internal subset probably has to be done in forward (non-reverse) order. Not a huge issue (since it's near the start, but something to keep in mind. Also, another question is how are you planning to combine the results? I assume you'd build a DOM (-like) result tree, since it's not easy to think of a useful stream abstraction. Anyway, it may still be a useful exercise in figuring out how to do things, although chances are there may not be significant speedup in the end ;-) -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|