|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Request for info about parser construction details.
From: Jose Luis Sierra Rodriguez <jlsierra@s...> >> I would like to find some information about technical details regarding the >> developing of parsers for SGML and XML. Writing a general parser for SGML is difficult because it is more like a compiler-compiler (like YACC) than a language-per se. First there is an "SGML declaration" language in which one specifies which character sets and mappings you are using, allocate characters to various abstract roles (which ones can be separator characters, which can be used in references) and various concrete delimiters (start-tag open) and the general parsing features of the language (if you leave out certain tags, or justhave them in reduced for such as <> or </> what rules does the parser use to fill in the gaps). Then there is a second language for markup declarations (DTDs) which tells you not only which elements and attributes can go where, but which elements can have start-or-end tags that can be omitted, at which point "short-reference" maps come into play (where some string of characters you specified in the SGML declaration can be used in the place of an entity reference and add some tag), and declare entities and the attributes that can appear on entities (keyed by the entity notation). Finally comes the instance language itself, and its also not straighforward: the entity structure is not synchronous with the element structure and potentially you can have subdocuments with completely different DTDs nested inside (like namespaces but with their own ID scope) and you have to keep track of more things, such as the current value of attributes marked #CURRENT (the attribute has the most recent value in document order if it is not specified) and global exclusions (such as that an <a> cannot contain an <a> which the DTD has special structures for. And in the DTD and instance there are entity references which will not necessarily fit in with a simple-minded approach to grammars that a beginner might hope for. So full SGML really requires three separate parsers, each supplying lots of parameters to the next. This is because full SGML was designed to allow clear description of lots of different kinds of markup languages, not just well-formed. (Actually, you do not need to support variations in SGML declaration to be conforming SGML, as long as you document what you provide in an SGML declaration and support at least the minimum Concrete Reference Syntax, which gives default rules that are closer to HTML's requirements and are too restrictive. An XML system is not a "conforming SGML system" but it is an "SGML system", but these are technical terms of conformance which make bore and confuse people.) So can you see why XML was invented? Instead of Charles Goldfarb's unhappy and forced starting position that people could never agree on syntaxes (see MS' versions of HTML dumped from recent software, and SML-DEV for recemt evidence of this) Jon Bosak started from with the idea "what if we could get everyone to standardize on a particular profile of SGML...then we wouldn't need highly parameterized document description languages (or at least the description would be made once for all by the profile-creators not by every user) and simple parsers could be written". The breakthrough in XML is not the technology (lots of people have been doing stripped down SGML for years) but the concensus Jon was able to get up. (Of course, Jon could not have gotten that agreement without there being a lot of lessons learned from full SGML concerning which features are most useful.) XML says freeze the SGML declaration (see James Clark's note at W3C for this). Have character encoding handled by the entity manager and adopt Unicode as the document character set. Get rid of any features which require the instance parser to accept parameters from the markup declarations. Make the markup declarations optional to use. Make entities nest with elements. etc. XML is designed to be straight-forward to implement, with little connection between its two languages (declarations and elements). So if you want a one or two week project, implement XML. If you want a six month to one year project, implement SGML. Don't try to implement SGML unless you have Goldfarb's "The SGML Handbook" (and you probably will understand more of XML with that too.) Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








