re: Request for info about parser construction details.
Jose Luis Sierra Rodriguez writes: > I would like to find some information about technical details > regarding the developing of parsers for SGML and XML. Of course > that the obvious solution is to examine the sources of some > existing parsers ;), but I would like to avoid myself this effort > by reading some higher-level documentation. In particular, I'm > looking for the sort of things that you can find in a general > textbook about compiler construction (for instance, that of Aho, > Sethi and Ullman) but applied to the particular features of > SGML/XML. Any pointer on this subject will be wellcome. Unfortunately, we all end up learning by doing. Here are a few pseudo-random tips that jumped to my fingers: 1. Recursive descent is just as fast as finite state and results in small, easy-to-understand code; however, unlike a finite state parser, a recursive-descent parser cannot easily save its state, return to the caller, and then start parsing again where it was. That means that recursive-descent is suitable for a push interface (like SAX), but not for a pull interface. 2. Too much copying can be brutal; if speed is your concern, try working directly from your input buffer as much as possible rather than copying parts of the buffer into strings or new arrays. 3. A large input buffer makes for faster batch performance, but a small buffer makes for better interactive performance (as always). 4. As a general principle, leave the heap alone and concentrate on the stack. Try to allocate everything you'll need at initialization time, rather than doing it over and over in main loop (since the main loop may go through tens of thousands of iterations in a second or two). This is especially important for garbage-collection languages like Java, since too much GC can grind your parser almost to a halt, but the heap is expensive even in C/C++. 5. If you're doing Namespace processing, you have to read *all* of the attributes before you can perform any processing on the element name or qualified attribute names, since the xmlns* attribute could be the last one. 6. Don't be afraid to place an upper limit on the amount of character-data content you'll process at once. You could end up with the pathological case of an XML document containing a start tag, 1GB of text, and an end tag -- there's nothing wrong with breaking the text up into arbitrary segments. 7. It's a good idea to layer your parser into an event stream (SAX or otherwise) and a tree API (DOM or otherwise), rather than having the low-level parser build the tree directly. Build tiny layers that are easy to maintain and test. 8. Profile, profile, profile. You'll be amazed at where the bottlenecks turn up. Of course, the standard advice -- document heavily, incorporate unit tests into your standard build procedures, etc. -- also applies. All the best, David -- David Megginson david@m... http://www.megginson.com/
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format