Re: SAX for Binary Encodings (SAD-SAX)
Simon St.Laurent wrote: > I'm sorry, Alaric, but this is the classic story that's done so much to > pollute XML and turn what was once a pleasant simplfication into an > industrial-strength nightmare. That it's frequently told by people who > believe it doesn't do anything to help it. Ok; > I wish you could have been at the Extreme Markup Languages conference > when Jeni Tennison gave a presentation on the impact of typing on XSLT > and XPath 2.0. As C. Sperberg-McQueen summarized it, "I was watching > all these faces, all of them asking 'if Jeni Tennison can't deal with > this, how am I ever going to?'" I think that the difference in typing between XSLT/XPath 1 and 2 is more about the fact that they ripped out the old XPath type system (it had its integers and strings and booleans and stuff) to replace them with XML Schema-compatible ones, than just *adding* typing. They didn't add stuff - they CHANGED stuff! So the old stuff isn't there any more! Although one perspective no this could be that switching from XSLT 1 to XSLT 2 is like turning on an option - the programmer chooses to do so if they want to, otherwise sticks with XSLT 1.0 - but the unfornate fact that the difference is in a *version number* rather than an *option flag* is that everyone assumes that 2.0 must be inherently better than 1.0 :-( > 1) We don't all get to choose. We don't all get to choose our tools, > and even fewer of us get to choose the data we work with. As these > things spread across the landscape, they become unavoidable. > > All of the tools I write for processing XML now support namespaces. > That isn't because I think namespaces are a good idea - in fact, I think > they were the first sign that the people running XML had no clue what > they were doing. I support them because I have to, both to make my > tools usable by others and because I have to deal with namespaced > information. I create it myself sometimes, a habit I got into when > using other people's tools. Ok. In order to see if this same pattern could cause problems with a typed extension to SAX, I'm going to try to map from the namespace issues into this. Namespaces became everyone's problem because all the important XML vocabularies started using them extensively, right? And because of the transfer syntax of namespaces - with the prefixes - a processor that isn't namespace aware really can't make much sense of namespaced elements and attributes, since they have this random prefix shoved on them most of the time, and non namespace aware applications would be using literal string comparisions between element names and constants such as "first-name" to see what element was what. Namespaces still aren't a problem for applications dealing with XML vocabularies that don't use namespaces - just there are very few of those. Part of the problem is down to the syntax used for namespaces; perhaps it would have been better if the Namespaces rec didn't introduce prefixes, but instead worked along the lines of: 1) Attributes don't get namespaces, only elements do 2) The attribute xml-namespace="URI" means that the containing element is in that namespace, and so are all its children unless another declaration states otherwise That way, a non namespace aware application would still be able to rely on the first name element being called "first-name", making it more backwards compatible at the cost of greater verbosity due to repeating that namespace URI every time you switch namespaces (ugly in XSLT for example...). This is clear in hindsight. I'm sure that if the Namespaces rec authors had thought about backwards compatability they would have come up with something similar, however. I presume, therefore, that they were not particularly worried about non-namespace aware applications, for one reason or another. SO - learning from the mistakes made with Namespaces, what lessons can we take into account when doing a feasability study of a type-aware SAX? "Really really think about what life will be like for people who don't want to use your optional extension, *even when they border with systems that DO*." Now, since this is just an API extension, it will have zero effect on the interchanged bits on the wire, so we needn't worry about issues there. All we need to ensure is that applications that do not need the extension be totally free of needing to change if their SAX parser started providing the option. Luckily, the SAX people are smart - they use URIs in strings to identify extensions in a way that avoids these issues. Is there a danger that, like with namespaces, lots of important XML vocabularies might start to depend on this SAX extension in such a way that applications are forced to use it to work with them? That's more of a potential issue, but the SAX extension just automatically handles something that you're already manuall doing anyway - parsing strings to get dates/integers/whatnot. You can still do it manually if you wish, meaning that the optional extension is not the only way to read in dates and so on - so it can't become a dependency if it's trivially removable. Lots of XML specifications already rely on *something* parsing integers, since they represent integers in decimal in XML! Perhaps the biggest danger is that there might be a slow creeping wave of highly complex syntaxes used in XML content - like SVG path expressions, XPaths and so on - and that everyone gravitates towards writing parsers for these as part of typed SAX parsers. So after a while, to parse XPath, you have the choice of: 1) Use a typed SAX parser, which will return you an abstract syntax tree for the XPath expression - and thus indirectly forcing you to use the typed SAX parser for all of your document whether you like it or not! 2) Write your own XPath parser from scratch Prevention of (1) is why I agree with the original poster's idea of having an option to the SAX engine to make it return *both* the original unmolested text *and* its attempt at parsing it. So you can just not use the parsing part (and ideally prevent it from wasting its parsing stuff you'll ignore in one of many ways) and keep accepting the plain characters for most of your application, while using the parsing part where you need it. This is a strong argument FOR making this as an extension to SAX rather than a new API - if you had to switch to a totally new API for all of your XML reading to parse XPaths, changing bits of your code that really needn't change, that would [expletive deleted]! > 2) Communicating expectations is harder than communicating data. Good > documentation and schemas can provide more information, but there's a > lot of experience behind "loosely coupled" vs. "tightly bound", > especially where participants are widely distributed. Yep - that's why I suggested the API handle the lack of type information, or the failure of type information to match what's in the document, by falling back to the existing SAX behaviour, in order to avoid this problem. > 3) Bad ideas that start in one place frequently wander elsewhere. W3C > XML Schema is probably the classic example of this. It's widely > despised, even at conferences - like last summer's Applied XML show - > where everyone claims to need that kind of tool. Nonetheless, it > continues to make life difficult for people from Word users to data > binding implementers to XSLT developers. Yeah :-( The problem here, of course, is the original badness of the idea combined with the unforunate fact that it was proposed by a voice of authority. However, good ideas from voices of authority ALSO tend to spread :-) > I'm happy to see ASN.1 working to make itself more accessible to > developers with different expectations, and I'm still happy to see ASN.1 > at work for people who actually want schema-first tightly-coupled > development. I'm not happy to see ASN.1-flavored proposals for > revamping XML APIs because they don't fit ASN.1 expectations. Building > bridges between the two worlds is good, but there's definitely a limit. Think about the usefulness of typed SAX beyond ASN.1, however - typed SAX events could be generated from an XML document with reference to a schema in the schema language of your choice. > XML has suffered enough here from types that you might want to pack up > that circus wagon and find another freak show where it'll be more > welcome. Please don't tell that bogus story about types being a > harmless option if you want me to take you seriously. How has XML suffered from types? As I see it: 1) The official language for attaching types to XML [expletive deleted] 2) This has had knock-on effects, such as the XPath/XSLT type system changing to align with XML Schema But types in XML are *still* an optional add-on, in ways that namespaces aren't! You only *need* to write code that knows anything about XML schema languages if you're writing a schema validator or an XSLT 2.0 engine, right? You can ignore references to schemas and xsi:type attributes to your heart's content, and your application that reads XML purchase orders and handles them will still be able to work, yes? A non-type-aware application that encounters <numFingers xsi:type="integer">010</numFingers> (or, equivelantly, without the xsi:type and instead with a schemaLocation attribute pointing to a schema saying the same thing) will either: 1) If it has no hardcoded knowledge about the element, just ignore it or pass it through itself verbatim, as applicable - preserving the leading 0, since it does not know of any interpretation rules concerning the element content, so MUST NOT ATTEMPT TO BE CLEVER. 2) Have hardcoded knowledge from the programmer (who had a copy of the specification for the vocabulary in front of them) that numFingers contains a positive decimal integer, and treat the content as the number ten; the xsi:type is just redundant extra information here. 3) Incorrectly (because it's broken) assume that the contents of numFingers is an integer written *backwards* with the least significant bit first, and remove the 'insignificant' zero at the end, and as such do something like output <numFingers xsi:type="integer">01</numFingers>. Case (3) is the one that people who fear type-aware systems stripping their 'apparently redundant' information away and breaking things seem to fear. However, only the *obviously broken* code does this... But getting back to the point - typed SAX. Type-aware interpretation of XML is a fact of life as soon as you start passing anything other than human-language text in XML. As soon as you have something like version="1.0" lurking around, software is going to start doing things like converting that to a pair of integers and performing integer comparisons to see if this is a version it can support. Typed schema languages (like XML Schema, not so much like DTDs) tend to set out a library of types, and a way of assigning those types to parts of an XML document, in an attempt to try and formalise this typing. Without such schema languages, we would instead say "The version attribute contains the version number", thus non-formally assigning a type. HTML is strongly typed; some attributes must have a valid URI in them, or an integer (width= and so on). This is not, in itself, a problem. The problems seem to have arisen in the area of the schema languages. But typed SAX - although it would DEPEND on some external schema language or something like xsi:type to get its type information in the first place - would not introduce any dependency on that source of type information into the application... and as I have visualised the interface, it would 'fail safe' in the absence of a schema by just reporting character data, thus not introducing a dependency on schemas or whatnot into the documents it processed. So, I ask, what could go wrong? :-) I might have missed something, some unforseen consequence... but I think the fundamental nature of this thing (doing something the programmer would do by hand automatically, but only if explicitly asked to do so, and giving up gracefully if it can't be done automatically) means that it can't possibly cause a problem. ABS
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format