[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Validation
Cheers, Marcelo On Thu, Mar 18, 1999 at 01:26:42PM -0600, Paul Prescod wrote: > Chris Lilley wrote: > > > > Unfortunately I came across EBNF long before I came accross DTD syntax, > > so about half an hour after meeting DTDs I was, like, what do you mean > > it can't express that this attribute is a url? Why can't it express that > > this attribute is an ISO standard date? > > I can guarantee you today that the XML schema effort will not allow you to > express everything that EBNF will so if that's your standard it will fail. > But even if we use EBNF as our standard: do you know of any programming > languages expressed entirely in EBNF? Or even entirely in *any formalism*? > > > Yes, validation is important - and I mean real validation, with no > > critical-path human-readable comments in the DTD and multiple utilities > > to check different aspects of validity (like separate scripts to ensure > > that an attribute is a valid date or customer number). > > It will never be the case that it will be possible to write schemas that > are so tight that they remove the need for comments that describe > additional constraints to other human beings. There will always be a need > not only for multiple schema languages but also for the ultimately > flexible schema language: prose text. At the risk of sounding repetitious, an analogy may be of use here: One mark of a good programming language is strong compile-time checking (type safety, pre- and post-conditions and invariants are typical measures of this). Users of such languages typically characterise them by exclaiming that programs, once compiled, usually work the first time. Of course no-one would argue that this is a bad thing. It would be silly, however, to take this as evidence that such languages are bug free. There will never be a bug free programming language. The simple reason is that no programming language can guess the desired semantics if you get them wrong. No language can stop you from trying to implement sort and ending up with reverse sort! All they can do is prevent you from ever exhibiting undefined behaviour (post-conditions can make it painfully difficult to do something so silly, but I doubt they could do so at compile time). Languages will never be capable of fully expressing what we want to achieve in a declarative way. Declarative languages such as SQL do provide a very elegant and expressive mapping between intent and code. But they typically address a very narrow and well-defined problem domain. No such silver bullet has been discovered for programming in the large. The purpose of this analogy is to illustrate what I believe to be the same situation in the notion of a DTD. A DTD defines the language used to express a certain data domain. This language provides some constraints on what constitutes a legal piece of data. However, just as a programming language can never fully express the intent of the user (i.e. it must always include procedural elements which implicitly rather than explicitly embody the intent), so too can a schema language never express the full set of constraints one might wish to impose on a document. It is easy to come up with trivial cases that demonstrate this: imagine a document class in which the number of paragraphs inside the nth section must be less than or equal to the nth fibbonaci number; or another in which the content model of a CONTENT element is defined in the PCDATA of the preceding MODEL element; or how about one in which the maximum depth of the element heirarchy is defined by the ascii value of the 100th character of the file stored at a given URL! Yes these are pathological examples, but the point of illustrating with extremes is to obviate the fact that no matter how sophisticated your schema language, it will never be able to handle all contingencies. There will always be someone, somewhere, for whom the available schemas simply don't express the semantics they need. If you seriously want to address the widest possible set of schema language requirements, EBNF doesn't come close; you would need Turing completeness just as the starting point. For those who would prefer to see a more concrete case-in-point, I adduce the very example Chris Lilley provided, that of validating a customer number. What if a valid customer number is a composite of the customer's priority level, her date of birth, and a sequence number? How would you express that the 3rd to 10th digits constitute a date in the format yyyymmdd? This would require a language with built-in functions for type conversion, substring extraction and date composition. What if people born after 1990 couldn't have a priority level higher than 25? This would require branching constructs. It is quite common for things like customer codes, site codes, product codes, etc. to have a composite structure, particularly in legacy systems. Often the parts are mutually dependent in non-trivial ways. Sometimes validity can only be checked by consulting an external, volatile source of information, such as a database (e.g. "Customer code is invalid if the first two digits do not appear in the CODE field of a record in the PRIORITY table..."). This is not, despite appearances, an exclamation of hopelessness. But rather to point out that completeness is not necessarily a desirable goal. If you wish to insist on developing a schema language that handles all the validation requirements of any conceivable data domain, you are aiming not only for an impossible goal, but for one which, if it could be attained, would be so complex, so arbitrary and so unwieldy that no-one would want to use it. In real life, multiple subsystems are brought into play when validating and transforming data. No single component can know everything about a piece of data, and certainly not enough to definitively ascertain validity in all its semantic richness. In fact, the full set of constraints may not even exist in one environment, but may be distributed across multiple independent subsystems, or even across multiple hosts; for instance, a timesheet workflow system may operate as follows: 1. Employee fills in XML timesheet and submits to accounts server. 2. The accounts server validates the project number field against projects the employee is permitted to book against, and passes the document on to the personnel server. 3. The personnel server knows that the employee doesn't work on weekends and Tuesdays and checks for this. It then dispatches the document to the repository. 4. The repository performs basic DTD validation, which the other two servers probably did anyway. The DTD (or XML Schemas) can provide a basic level of validation, but there will always be more to do. And this is not a problem if you accept that there may be multiple stages to the process, probably involving multiple languages and environments. The complaint might be raised that the examples I have given mostly involve table lookup and therefore belong more properly in the domain of referential integrity mainenance, but this is not necessarily so. For instance the accounts server may know that the project number and employee number must begin with the same two digits (due to the organisational structure) unless the project number begins with 99 (which represents admin codes). Furthermore, in the case of a complex customer code, validation may involve table lookup, but it is not with a view to ensuring that the customer code refers to an existing record, and hence is not a referential integrity constraint (the constraint could even be revised to: "Customer code is invalid if the first two digits do not appear in the CODE field ... _and_ the date of birth is after 1990," in which case the record could be valid even though the lookup did not find any matching records). Quite apart from the problem of intractability, there is the equally important issue of parsimony. For many purposes, a fully expressive language is more than one needs. Consequently, the user is forced the to learn a complex environment to perform a simple task. This is why a language like CSS is in no danger of being superceded by XSL. It doesn't express everything XSL (or DSSSL) can, but it is simple. An average hack Web master can come to grips with CSS in a matter of minutes, and can be using it to good effect within half an hour. Not to mention the fact that CSS is just plain easier to read (I hear much debate about whether it is appropriate for humans to edit XML directly, but I haven't heard anyone suggest that XSL should be machine generated; I wonder about this from time to time). For that matter, XSL and DSSSL can't express every conceivable typography requirement either. Another concrete example comes to mind in the domain of configuration files. I have played around with moving our configuration file format (which is a little ugly at present) to XML. I was horrified at the result and am now looking far more seriously at something like .INI files. It may not be intrinsically heirarchical (and hence is less expressive), but it is much simpler, and much easier for a human to read and manipulate. Likewise, DTD's and XML Schemas will offer differing levels of constraint specification, but neither of them (nor any future language) can express every kind of validation rule that people will want to express. Life is simply too complex for that to be possible (more specifically, real life is arbitrarily complex, and hence so are the systems that try to model it). > Luckily, eliminating all other schema languages is not a goal of the W3C > schema language effort. > > > So what is critically needed is a real, namespace-aware, schema > > language that can be used to do real validation. > > I hear a lot of users saying that. They don't seem to realize that there > is no such thing as "real validation" there is only "the validation I need > to do today." Ten years from now, we'll be griping that XMLSchemas don't > do "real validation" for some other arbitrarily advanced definition of > "real." I heartily concur. There is no silver bullet, so it is a waste of time looking for one. The focus should be on developing standards that solve today's problems today, with an eye to leaving room for future wisdom without being prescriptive. Of course, none of the above discourse will eliminate the need for discussion on what, exactly, is needed and how that need is to be satisfied. As one colleague astutely pointed out to me, I am really transforming the issue from "real validation" to "sufficient validation". It would be a mistake, however, to conclude that this is a trivial transformation in the statement of the problem. It diverts the emphasis of the search markedly away from completeness and towards practicality and useability (of course, completeness remains desirable, it merely ceases to be a central goal). Cheers, Marcelo -- http://www.simdb.com/~marcelo/ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|