[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Why datatypes?
From: "Gustaf Liljegren" <gustaf.liljegren@x...> > Ever since XML Schema started to evolve and the talk about datatypes in XML > took off, I've been wondering secretly why XML validation needs the concept > of datatypes at all. XML is a plain text format, so content validation in > XML should be no different from regular pattern matching. Or why should it? Well, for a start "XML" has no needs, not being a person! It is users who have needs, so the focus of any question has to be users. Some users do not need datatypes. For example, people who are just sending strings in graphs to each other in their documents. Publishing tends to this extreme. Or people who are confident that their data values are valid (because it was validated at data capture and the recipient trusts the sender.) Or people who have such high transaction rates they cannot really afford any more checks, or who have checks already built in by subsequent stages. But many other users do want datatyping, because they want to perform QA on outgoing data or QC on incoming data. For editing, datatype checking can allow friendlier messages so that problems can be fixed at source rather than requiring technical personel in the middle of the chain (or worse, at the far end) to make the data right. For programmers, you would be aware of the big trend towards prorgramming- by-contact (in nice Bertram Meyer's terms) which has seen assertions added to Java 1.4: datatyping (and validation languages in general) have a good use for making invariants explicit, and for use in unit testing. Datatypes have a use more than just for validation: if the datatype aligns with a "storage types" (e.g. its constrains a number's value space to whole numbers 0-255 which will fit into a byte) it can be used to drive interface builders: for example to make a schema-specific DOM that stores data very efficiently. If the datatype expresses its semantics (e.g. "this is a date") then it allows conversion between different lexical forms (e.g. US gregorian date to Australian gregorian data) and translation between different value spaces (e.g. between Gregorian calendar and the Islamic Calendar, assuming for the point of argument that they are different value spaces). So you can see that there are actually categories of datatyping: * value-constraining * storage aligned * semantic and that there is no universal agreement (or reason to expect or want one) one which is better or best or appropriate or wrong. Even the issue of "should these be separate layers or should these be mixed?" has no concensus. For example, in the WC Schema specs we find strings (value constraining), bytes (storage aligned) and dates (semantic). But at W3C we also find RDF Schemas which is much more concerned with (a framework for) semantics. Another aspect of datatyping is whether to express it declaratively or functionally: do you say "this is positiveNumber" or "this is a number > 0"? In the first case, which is more declarative IYKWIM, a system can easily figure out the value constraints, the storage alignment (and perhaps the semantics.) You can use Schematron for lots of datatyping, but it is functional not declarative in that sense: one of the reasons for XML Schemas building in so many derived simple types is to make it easier to figure out the storage alignment and semantics. (The proof of the pudding is always in the eating, of course.) So Schematron datatyping is good for validation but not much use for figuring out efficient storage structures (of course, this was not a goal!) It is tempting to conclude from the above that "some people need less datatyping, some people need more; some people need just lexical typing, some people need value typing, some people need storage or semantic datatyping". That is true as far as it goes, but it hides two essential points, which are at the heart of the datatyping problem. Your answer to these will largely determine many technical choices you make: 1) Should datatyping be proscriptive or descriptive? 2) Is there structure inside data values PROSCRIPTIVE or DESCRIPTIVE The proscriptive approach is exemplied by XML Schemas (though tempered for practicality by its derivation facilities). It says "you can only use one lexical form, and we supply a comprehensive list of built-ins; anything outside that you simulate using regex checking and providing your own validators". People who favour the proscriptive approach tend to feel that users are always shielded from actual XML values by user interfaces, so in a sense a lot of the value comes from everyone standardizing on the same set of types rather than from the completeness of the types themselves. The descriptive approach says "I have data in a particular preferred lexical form, and I want markup to describe it" In this view, user may edit the XML as text or only have thin interfaces where the user types the value directly. The documents may well be stored in text files where there is no mediating infrastructure to perform conversions. For example, "You want to send in your American documents <usDate>12/31/02</usDate> and I want to send my Australian documents with <auDate>31/12/02</auDate> and we want the recieving system to validate them both as dates, and allow mixtures and comparison." NON-XML STRUCTURE The no-structure approach is exemplified by XML Schemas (though tempered for practicality by lists and unions). The view can be characterized as "We only need to worry about explicit XML structure." In other words, only elements are of interest for validation. The non-XML structure approach says that the idea that there is only element structure containing atomic types flies in the face of how people actually use (and want to use) XML. It is a 3rd normal form assumption that can be refuted merely by looking at almost any real DTD (not being a DTD used for data transfer to or from DBMS): for example, XHTML, SVG, XMLFO, etc. In this view, the XML Schemas division between simple types and complex types is weak: there is a missing level of non-XML structures which XML Schemas will either model badly (as strings) or model as if they are simple types (such as gDates, and therefore get into trouble) or not at all (such as measure="1cm 2inch 3 em" ) In this view, there are actually probably very few primitive datatypes [numbers, boolean, string, symbol?] but a variety of tokenizing rules (space separated, Unicode block separated, COBOL-style pictures, punctuation-separated, etc.). This is an area of interest to me: it would be interesting to analyse the attribute values in a spectrum of publishing/scientific languages (SVG, XHTML, XSL, etc.) to see if there are, in fact, a great variety of tokenizing types or if there only a handful of parameterizable types (i.e. to avoid going to regular expressions or parsers.). (The presence non-tag structure is actually built into ISO SGML: you can, inside an element, declare a map which recognizes certain strings as delimiters that introduce or separate structures. So this is not some fancy wishful thinking, but something that XML gave up for parsing simplicity. I am not trying to reintroduce SHORTREF into XML! But there is data that has structure that we want to validate but not split into different information units: dates and URLs are good examples.) I hope this is some use. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|