|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML parser using lex & yacc
This note is only for people really interested in low-level parser im- plementation details; others, please ignore. Richard Tobin wrote: > > > I want to develop an XML parser in C or maybe C++ for an > > undergraduate university project. My approach will be to prototype > > the parser using flex and bison. > Using your own lexer may be the best approach Probably not for an undergraduate project. Too error prone and time consuming, I'd wager. And in fact, if XML had to be approached this way I'd say that it was fundamentally mis-designed. It would be a very bad mistake to design a modern markup language this way. I don't know if the XML designers though about these issues, but it is possible, with a few kludges, to parse it with Flex and Bison. > Or you might be able to replace the lexer's input functions and change > its character type to integer (if it isn't already); this would work > for UTF-16 (the other required encoding) too. Get the flex source at prep.ai.mit.edu (/pub/gnu/flex or whatever) and patch the source with James Lauth's Unicode patches: ftp://ftp.lauton .com/pub/flex-2.5.4-unicode-patch.tar.gz. Override the default Flex input routine with one that checks the file format (all it has to do is parse the first few chars of the XML decl as per the relevant appendix to the XML spec, then read the entire XML decl for an encoding decl; you then rewind the file, store the file's format and other important information in a lookup table, then use that lookup table when reading in characters to determine what translation to use for that file; convert everything to UCS-4, or perhaps UTF-16, internally, so the above Flex patches will work; only need to do the format check once for every file, since thereafter the lookup table may be consulted). > The most obvious problem with using yacc/lex type tools for XML is > that keywords aren't always keywords. For example, in some places > in the DTD "SYSTEM" is a keyword and in others it would just be > a name. Just make sure that all your keywords can be both keywords and Name sequences (you'll see what I mean when you read the spec). Then write your syntax rules so that wherever you need an Nmtoken sequence in the parser it will accept a Name or a Nmtoken sequence (this can easily be accomplished by having a rule, NameOrNmtoken : Name | Nmtoken, and by then using NameOrNmtoken wherever you'd be inclined to use Nmtoken). You'll see what I mean once you start writing the parser. Would have been a lot easier if XML had introduced the notion of re- served words. Would also have been easier if the XML spec had aban- doned the notion of whitespace as a grammatically significant token inside of markup. Inside markup it should essentially be ignored at the parser's (as opposed to lexer's) level. It's the way virtually all modern languages are designed. And I gather (Handbook, 65 [371:16]) that it's largely how SGML should work as well. -- Richard Goerwitz PGP key fingerprint: C1 3E F4 23 7C 33 51 8D 3B 88 53 57 56 0D 38 A0 For more info (mail, phone, fax no.): finger richard@g... xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








