|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Fast text output from SAX?
Elliotte Rusty Harold wrote: > At 9:37 PM -0400 4/13/04, Stephen D. Williams wrote: > ... > >> input document, delta, perform read/update/delete, insert, append >> output new delta > > In practice input and output are the only things that matter. Compared > to this all manipulations of XML in memory are in the noise by several > orders of magnitude. Optimizing read, update, delete, insert, etc. is > pointless. These things all happen in effectively zero time compared > to parsing, serialization, and sometimes the program's unique > algorithms. (If the input is an XML document containing 5000 cities > for which to solve the travelling salesman problem, no XML > optimization will help you.) But the XML manipulations and traversal > are trivial. If this were strictly true, then esXML would be the only solution from an efficiency perspective. In pure esXML/esDOM processing, there is never any parsing or serialization or object creation or object population and the memory allocation needed is very minimal. (There can be an optional step of 'condensation' on 'output' to fully optimize space use, not needed in most cases I think.) The fact is that creating, populating, and manipulating a data model has costs. This is true of DOM, SAX (where the data model is managed by the application), esXML (where the data model is also the 'serialized' format so all costs are manipulation), and all other applications that involve internal and external data (Corba, DCOM, ONC-RPC, ASN.1/xER, etc.). It's not fair to ignore part of the processing cycle for a format (esXML) that trades some manipulation overhead for all parsing/serialization/object creation/object population overhead. The performance equation for esXML in 'native' mode is: XML - parsing - serialization - memory allocation - object creation - object population - extra memory use (5x for DOM, right?) + per-element processing overhead - better locality of reference factor Additionally, the whole parsing etc. stream for XML must be completely performed, in DOM cases and many SAX cases, for every element of a document/object. With esXML, if a 3000 element document/object were read in and 5 elements manipulated, you only spend 5*element-manipulation-overhead. Are there pathological cases where conversion-to-native vs. managed-data is better because of the magnitude of processing? Certainly, and the traveling salesman problem may be a good example (if it's a single-process algorithm). I am trying to push the tradeoff crossover point far past the point where that is true for most business-type applications. Certainly a distributed N-tier SOA processing environment with many hops through specialized service steps should show large gains. For anyone who doesn't think that something like esXML is possible, there is a simple 'works as if' design that, although suboptimal in space and other ways, achieves the avoidance of parsing and serialization: Imagine a DOM library where the document object was a growable buffer, or growable array of blocks. Each node that you create, rather than being on the global heap (i.e. malloced/new) is instead created as an allocation of part of the data space of the document block. Each node, instead of being connected to other nodes by pointers/references, was instead connected by network-byte-order relative pointers/offsets. The data is stored as part of a node in XML-compatible character encoding (by default), rather than a pointer/reference to another native object (like a Java string). Traversing, reading, updating, and deleting are all done similarly to a regular DOM data structure, but at any time, you can perform a simple write of the data space 'on the wire' (i.e. over the network), to a file, to shared memory, etc. Once read in again, no preprocessing is necessary to begin access. Some data formats have been used in a read-in-and-traverse model, but few or none have been both general purpose (a la XML) and allowed in-place update. If you cared nothing about space use and other semantics and constraints that I do care about, this would work. It is not just a 'serialized DOM', but a directly accessible and manipulatable DOM. esXML does not use a DOM-based linked node model as above, but rather a packed tokenized length-counted format with an underlying data structure, 'elastic memory', that minimizes data movement for updates. Optional tokenization tables and element indexes, along with a low-level delta method add to efficiency aspects. I've been writing the code, but I'm slow as this is still a personal project so I can retain control and have the main version be open source while my professional life is all-consuming. I'm sure that community development will change things a lot, but I want to get it close to right to begin with. sdw -- swilliams@h... http://www.hpti.com Per: sdw@l... http://sdw.st Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw begin:vcard fn:Stephen Williams n:Williams;Stephen email;internet:sdw@l... tel;work:703-724-0118 tel;fax:703-995-0407 tel;pager:sdwpage@l... tel;home:703-729-5405 tel;cell:703-371-9362 x-mozilla-html:TRUE version:2.1 end:vcard
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








