Specifying formal semantics in XML languages
[I do not, I'm afraid, read xml-dev regularly so please forgive me if this covers recent discussions or I am simply out of touch]. I am struggling with how to continue to formalize the semantics of Chemical Markup Language (CML). The issues are generic and no chemistry is required to understand them. They bear some relationship to "microformats" but involve issues of strong typing and code generation, so "context-free objects" is more descriptive. Currently CML (in XSD) consists of about 100 elements, 100 attributes and about 100 simpleTypes (e.g. elementTypeType is one of 117 symbols ("H", He", ...) and angleType is an xsd:double in the range 0-180). The current components (in XSD syntax) are at: http://cml.cvs.sourceforge.net/cml/schema25/ Chemistry is a largely context-free discipline in that we can locate (say) <molecule>) in many places in a document. There are a very large number of ways of using CML components but the major ones in current practice are: * compound documents (e.g. scientific publications) composed of a range of markup languages (XHTML, SVG, MathML, CML, etc.). Many publishers are now actively starting to adopt this approach. A key approach is that data and text are mixed ("datument") so that we can transmit data in primary publications. Machines can now start to understand scientific publications. * storage of (fairly) well-defined data objects (molecules, spectra, etc.) in databases * management of the chemical computational process including formal semantics of objects in the program. This is sufficiently broad that it is impossible to create a traditional XSD schema which allows for all uses. Since CML continues to evolve it we cannot guess a complete schema description and then constrain people to use it. However since almost all CML must be processed by specialist software at some stage we require conformance to a specification. Moreover there are no user communities who require all CML functionality at once and so we assume that particular groups will use subsets of the language. We have used XSD rather than Relax because the conventional world feels happier with it (sorry), and because we have to provide software support for CML - there are more reusable components generally available for XSD. However we only use a small subset of XSD syntax (basically the stuff I can understand), limited to: * definition of elements containing explicit complexType and references to element children * definitions of types * definition of attributes There is no single schema, but users can choose which subset of CML elements they wish to use. This is simply done by concatenation of the components (we deliberately do not use xsd:import). The specification is used for the following: * validation of documents * (complete semantic) documentation of the language (IOW the specification should be a machine-understandable description of the language.) It is inspired by the ideas of literate programming and will use <appino> etc. This is not complete and this mail is to seek guidance. * generation of code. This is critically important as all elements have to have classes, and all attributes have to have typed accessors and mutators. Although we could use Castor, XMLBeans, etc. for Java we have to support Python, C++ and FORTRAN so that I have written our own code generator to provide this. Of course there is much chemical functionality that is not provided by a semantic specification and this has to be handcoded on a per-element basis. At present XSD is used for the specification of CML although we have also attempted to use schematron and XSLT-like expressions for some of the constraints that cannot be expressed in XSD. (XSD is good for formal documents such as tax-forms but it is poor for the evolution of a scientific language). Currently we find: * most of the datatyping can be done with simpleType and this works well - there is no reason to change most of this * we find little use (at present) for re-usable complexTypes. * XSD content models are effectively useless for validation. They rapidly become enormous for some elements and no-one would use them. * there are many simple relationships that cannot be expressed in XSD. There are no cases where we insist on the order of child elements (I can never remember the order anyway so it's unfair to require others to).There are also very few cases where the cardinality of children matters (wherever we have tried these we come up with counter examples). We forbid mixed content in CML and so elements are of 3 types: * empty * one or more element children * one text child (If CML requires running marked up text we use <xhtml:div> or similar) Currently the attributes and content models are used to generate code. Thus <propertyList> can have (say) a title attribute, and children such as <metadataList> and <property>. This generates code such as: PropertyList.setTitle(String title) MetadataList PropertyList.getMetadataList() PropertyList.add(Property) This is enormously valuable when programming as it helps to ensure strong typing and provides prompts and checking when writing code. Therefore we continue to need a specification that describes the relationship of one element to another and, where appropriate, supports the generation of code. Here are some examples of relationships which I currently need to express and which should, if possible, be enforceable in code. * element must have a parent from (list...) * element may have parent from (list...) * element must not have parent from list * element may have children from (list) (and this will generate code) * element must not have children from list. *element may either have a foo attribute or a <foo> child accessible through a single getFoo() method *element must have either a foo attribute or a bar attribute. * Many elements are of the form <foo ref="a1"/>. In this case an element <foo id="a1"/> must occur within the document. We do not use XML-IDs for this as we cannot rely on the documents having unique ids. (Some of our algorithms find the "nearest" element with a given id) * Values may be required to be distinct. Thus in <foo refs="a1 a2 a3 a4"/> all values in the list must be distinct. (This sort of thing takes half a ;age in schematron) (There is also a need for chemical restrictions and validations, but I omit these here). I am therefore looking for a way of specifying semantics of this type in <appinfo> elements on some or all elements. It is important that the semantics are not procedural (we cannot assume that the users have Python, etc.). There is currently no requirement for speed, so XSLT is a possible solution although it is very difficult to evaluate scientific functions in it. I believe that there could be value in a lightweight declarative language in <appinfo> elements which would support validation *and code generation*. If this already exists that would be wonderful - if it doesn't I hope the above makes sense. P. Peter Murray-Rust Unilever Centre for Molecular Sciences Informatics University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK +44-1223-763069
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format