[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML processing experiments
Tim Bray wrote: > The following are candidates for why a program like Lark or MSXML > might run slower. > - works with Java char rather than byte variables > - does a method dispatch (or at least a few conditionals) per > character processed for at least two reasons: to manage the entity > stack, Given XML's requirements that entity references in the instance are synchronous, I would have thought that the overhead of an entity stack could be avoided for parsing the instance. The parser passes the application an entity reference event, and the application can then, if it chooses, recursively invoke the parser to parse the referenced entity. > and to have a place to put the different character encoding > processing modules. The issue of how to deal with multiple encodings is an interesting one. The straightforward approach is to abstract an encoding as a process that converts from bytes to characters (and vice-versa) and perform this conversion process before parsing. This involves a significant performance hit. This is particularily the case if you want to get correct byte offsets when using a variable width encoding (such as UTF-8); it's hard to do this without a method call per character. I think it ought to be possible to abstract an encoding in a way that avoids this. Instead of having a two-stage process -- convert a stream of bytes to a stream of characters and then divide the stream of characters into tokens -- there would be a one stage process that converted a stream of bytes into a stream of characters already split up into tokens. I haven't worked this through yet. > - does quite a bit more work upon recognizing some markup > constructs; in particular for a start tag it pulls > appart the attribute list and packages up the element type > & attributes in a nice structure convenient for an API user One ought to be able to do at least some of this work lazily. For example, your API can say here's the start-tag, and can then provide a separate function that pulls apart the start-tag if the user needs it. This could be useful in reducing the heap usage when building trees for large documents (MSXML used about 25Mb on the ot.xml file): instead of building some complex data structure representing the element type and attributes for each element, you simply store a pointer to the element's start tag, and then parse the start-tag to extract the element type and attributes if and when the user requests them. This would work particularily well with file mapping (CreateFileMapping/mmap). Is there any way to get at this sort of OS functionality in Java? > One other experiment would be useful, that might shed light from > a different angle. James, how about doing element counts per type; > i.e. actually *using* some of the info come back from the tokenizer, > nothing fancy, just use a java.util.Hashtable or some such; should be > able to run very similar code on Lark and your TokenStream thing; I > wonder if it would change the numbers. I've done this. Here are the timings I got: [0. Using bytes, counting total number of elements: 1.5s] 1. Using chars instead of bytes and Reader instead of InputStream; use the standard InputStreamReader: 2.3s 2. Same as 1, but roll my own XmlInputStreamReader that simply throws an exception on non-ASCII characters: 1.9s 3. Same as 2, but extract element type name on end tag as a String: 2.3s 4. Same as 3, but count element types using a Hashtable: 2.8s 5. Same as 5, but use custom hash table to avoid allocating String when element type name is already in the table: 2.4s James xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|