[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] fwd: metaphorical Web
I don't normally forward newsletters to xml-dev, but this one has a very interesting report on Web Services and questions about things like binary representations of XML infosets. The XSLT 2.0 piece that follows may also be interesting to people with no interest in the Don Box story. Thanks to Kurt Cagle, both for writing this and for saying it was okay to redistribute it. ----------------------------------------------------- **************************** Kurt Cagle's Metaphorical Web **************************** Wednesday, October 16, 2002 http://www.kurtcagle.net kurt@k... **************************** ========================================== Out of the Box Don Box and Microsoft's XML Architecture ========================================== I had the pleasure last night of listening to Don Box, one of the principal architects of SOAP and, as of January of this year, the Program Director for Microsoft's XML Architecture Group. A tall, energetic man with a salt and pepper beard and owlish glasses, Don held the audience of developers at the Seattle Dot Net Users Group rap with his discussion of what is usually a deadly-dull topic -- technical standards. Don is in charge of the group within Microsoft that deals with the pipes and plumbing of Microsoft's .NET/web services strategies. He is, in essence, sitting at the very epicenter of the most profound changes that have taken place within the company since the heady days of the browser wars in 1995 through 1997. The roadmap that he is laying out now will likely end up shaping application development at the software giant easily for the next five to ten years. The strategy that Don laid out last night was, to say the least, audacious: push through standards that will rebuild the Internet from the ground (or perhaps more accurately, the sockets) on up, replacing not just the http layer but potentially the tcp-ip infrastructure. In its place would be a more stateful web, utilizing variable length SOAP messages that would be more conducive to web service architectures than the current unreliable, packet based system. To get an idea of what that will likely entail (and why it may have such a high payoff for Microsoft ... if they don't fail), its necessary to understand how sockets currently work. In the early 1980s the Berkeley Socket Architecture was built in order to make it possible to stream content between two computers using a certain message format called the Transport Control Protocol, or TCP, with the packets being limited to containing only up to a limited number of bytes. The IP protocol overlays the TCP layer and controls the reintegration of messages. Most operating systems have integrated the Berkeley Socket architecture and have built networks using TCP/IP, to the extent that the older Banyan/Novell IPX architecture is becoming an anachronism. The WS-Routing specification, in an effort spearheaded by Microsoft and IBM, would break packets along SOAP boundaries rather than at preset lengths, an as such would allow for the efficient transmission of complete SOAP commands, though it would rely upon TCP/IP packets and even HTTP for the transmission of non-XML attachments such as images, sounds or multimedia. To do this effectively, it would mean that every single operating system would have to adopt the WS-Routing architecture or be shut out of the process; the danger here is that you would end up for a while with a two tier Internet where much of the world is not on WS-Routing, with the very real consequence that TCP/IP-HTTP solutions would need to be built to bridge, actually decreasing the efficiency of the networks over the few years that it would take for such a changeover to occur. It also assumes a willingness to modify or even replace billions of lines of code that have been built to utilize the TCP/IP architecture in order to go to this supposed next stage. Don talked about a number of the other standards that Microsoft is currently trying to develop, either through their own auspices or in conjunction with IBM, Ariba, and others. These include distributed agreement protocols (WS-Coordination and WS-Transaction) for performing stateful transactions, federated oriented security (which includes an alphabet soup of protocols), and ubiquitous metadata for handling policy data. In some cases (such as with security) these efforts are being coordinated with OASIS, and in others they are being proposed through the WSIA, a standards body that Microsoft co-founded. Significantly Microsoft is working only grudgingly with the W3C for the base web services specifications of SOAP and WSDL -- ironically the two standards that seem to be the most solid and widely adopted. Whether or not that is an anomaly or a central datapoint may ultimately determine the fate of Microsoft's .NET efforts. One other facet that Don discussed that I think may point to some significant innovation is his discussion about the XML "stack". XML actually refers to three different concepts. The first, the one that most people are familiar with, is the syntactical expression of "frozen" XML, the angle bracket tag and attribute syntax that most people who work with XML are familiar with. Above this is the conceptual underpinnings of XML, the XML Infoset, which basically is the abstraction of a named tree structure with multiple types of nodes. This infoset really doesn't care about the syntactical representation of XML -- it is instead a document object model as represented internally any number of different but congruent ways between systems (i.e., the way that Java and .NET represent XML in memory are almost certain to be different, but they are equivalent in terms of the abstract model, the infoset). The third form he brought up (the Post Schema Validation Infoset) is an infoset representation of XML, but with each item having a specific schema association with it. The idea here is an important one, perhaps even crucial in the realm of programmatic interfaces, though I think there is a danger here in thinking that simply because you have an abstract model with intrinsic type associations, that this is equivalent to an object that can readily be passed between systems. Don brought up a goal that has occasionally been floated of having a compact, binary version of XML for intersystem communication, in part because the cost of parsing on the one hand and extracting on the other add considerably to the total cost of transactions. However, the same arguments that applied three years ago when this argument first arose come out now -- within a homogenous environment, passing binary objects is generally not a problem, and passing an inforset that has been rendered as a DOM is far more efficient than the parse/deparse mechanism that currently existing for passing XML. The problem is that the internal binary representation of that infoset IS extremely dependent upon the architecture of the host system, and that fact will likely not change any time soon. On the other hand, it is possible that a binary to binary translation layer might actually prove to be an easier sell than the older COM/CORBA bridge interfaces that (almost) facilitated intersystem communication. With the establishment of a consistent DOM through the W3C, being able to work with a schema-aware infoset between systems has at least a chance to work, providing that there is some effort made to insure that the bridges are kept open on both ends. There was a lot more from the talk that I will try to cover in greater detail in subsequent columns. I don't completely agree with every aspect of what I'm seeing Microsoft do, I can see valid reasons for most of it. Perhaps as a caution, its worth noting that there are standards bodies and then there are standards bodies. The fact that much of the application level protocols are running through OASIS is ultimately a good thing, because with an effort as Herculean as this, the more hands you can get to push the boulder up the hill, the more likely you'll reach the top. ============================================ Code: Creating named regexes with XSLT2 ============================================ Here's some more exploration with some of the features in XSLT2 and XPath2, specifically the Regular Expressions capabilities. For those of you who are not familiar with them, regular expressions (or regexes for short) use a set of predefined patterns and special characters to attempt to match a whole class of potential strings. They have two principle purposes: validating that a given string does in fact fit a specific profile and transforming one string into another based upon general pattern matching, rather than specific character matches. For instance, consider phone numbers. Most American phone numbers follow a very distinct sequence: three digits giving the area code (or the toll free code, in some cases), three digits indicating the exchange, and then four digits containing the local code within that exchange. These are critical. The problem is that there are also a number of different ways of grouping these numbers, and when someone enters such a number into a form, for instance, it would be nice if you could determine whether the phone number is valid in the permutation provided. For instance, for the phone number with area code 800, exchange 555 and local number 1212, the following are all valid: 800.555.1212 800-555-1212 (800)-555-1212 (800)555-1212 (800)555.1212 while 800.5554.1212 is not because the exchange has four digits instead of three. XPath2 provides a number of string manipulation functions that accept regular expressions as arguments, but the two that I wanted to concentrate on are the matches() function and the replace() function. The matches() function takes the string to test and the regular expression to test against, and returns a Boolean value of true() if the expression matches and false() if it does not. The regular expression for validating phone numbers can be pretty ugly, but here is at least one stab at it: ^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$ (1) without going into a lot of detail, this basically says: ^ Match from the start of the string \(? Accept an optional opening parenthesis (\d{3}) Find a sequence of three digits (\d) and remember them \)? Accept an optional closing parenthesis \s?\-?\.?\s?Accept white space, a dash, a period, and maybe more white space (\d{3}) Remember the next sequence of three digits \-?\.? Accept an optional dash or space (\d{4}) Remember the final sequence of four digits $ The string must terminate at this point The matches() function would take a string (such as a phone number) and evaluate against the above regular expression, as follows: matches('(800)555-1212','^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$ ') This would return the Boolean true() because the pattern in regex #1 is satisfied. Similarly, you can use the replace function to perform a substitution of a new string for an old string within a third string. The replace function uses the Perl notation of back references -- if an expression in the regex is contained within parentheses, it is remembered in the order that it was encountered. The back references provide a way to retrieve these remembered expressions. For instance, in replace('(800)555-1212','^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$ ','$1.$2.$3') the first expression to be matched (the area code) is assigned to back reference $1, the second (the exchange) to back reference $2, and the the third (the local code) to back reference $3. This in turn will provide the output: 800.555.1212 Now, I don't know about you, but '^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$' doesn't exactly stand up and scream "phone number" to me. This tends to be the case with many regexes - they can be puzzled out with a lot of work, but in general they are far from being intuitive. Consequently, I got to thinking about how I could build a general library of regexes, each of which I could then refer to by name. As it turns out there are two very different approaches that you can take, each with its own advantages and disadvantages. The first approach places the regexes into an XML file, with each regex being referenceable by name. For instance, the following illustrates just such a regular expression library (regexLib1.xml): <regularExpressions> <regularExpression id="phone"> <pattern>^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$</pattern> <replace>($1)$2-$3</replace> </regularExpression> <regularExpression id="zipcode"> <pattern>^(\d{5})(-\d{4})?$</pattern> <replace>$1$2</replace> </regularExpression> </regularExpressions> This document establishes two regular expressions - one for phones, one for zipcodes - along with the standard replacement forms for encoding these. With this approach, I can define a set of two XSLT functions in their own namespace (re:) called re:isValid() and re:format(). The re:isValid() function takes the string to be validated and tests it against the regular expression named in the second argument. For instance, re:isValid('800.555.1212','phone','') => true() will return the Boolean value true() indicating that it is a valid phone number. The third argument is either a local or absolute URL to a library of regular expressions, and should usually be set to the empty string '' to use the default regexLib.xml file. Meanwhile, the re:format() function takes a valid (but not necessarily conformant) input string and converts it into the standard form given by the <replace> element: re:format('800.555.1212','phone','') => '(800)555-1212' Here is a preliminary regexes.xsl library file, showing how these functions are implemented. <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:re="http://www.solvex.com/schemas/regex" exclude-result-prefixes="re" > <xsl:output method="xml" media-type="text/xml" indent="yes"/> <xsl:variable name="regexes" select="document('regexLib.xml')"/> <xsl:function name="re:isValid"> <xsl:param name="str"/> <xsl:param name="formatType"/> <xsl:param name="regexLibFile"/> <xsl:variable name="regexLib" select="if ($regexLibFile) then document($regexLibFile) else $regexes"/> <xsl:variable name="re" select="$regexLib//regularExpression[@id=$formatType]"/> <xsl:variable name="pattern" select="$re/pattern"/> <xsl:variable name="target" select="$re/replace"/> <xsl:result select="matches($str,$pattern)"/> </xsl:function> <xsl:function name="re:format"> <xsl:param name="str"/> <xsl:param name="formatType"/> <xsl:param name="regexLibFile"/> <xsl:variable name="regexLib" select="if ($regexLibFile) then document($regexLibFile) else $regexes"/> <xsl:variable name="re" select="$regexLib//regularExpression[@id=$formatType]"/> <xsl:variable name="pattern" select="$re/pattern"/> <xsl:variable name="target" select="$re/replace"/> <xsl:result select="if (matches($str,$pattern)) then replace($str,$pattern,$target) else ''"/> </xsl:function> </xsl:stylesheet> Finally, I wanted to include an xsl file that imported these routines and used them in something approaching a real world basis (regexesTest.xsl): <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:re="http://www.solvex.com/schemas/regex" exclude-result-prefixes="re" > <xsl:import href="regexes.xsl"/> <xsl:template match="/"> <xsl:variable name="phoneNum1" select="'800.555.1212'"/> <xsl:variable name="phoneNum2" select="'800-5554-1212'"/> <xsl:variable name="zipCode" select="'45221'"/> <html> <body> <h1>re:isValid</h1> <p>The phone number <xsl:value-of select="$phoneNum1"/> is <xsl:value-of select="if (re:isValid($phoneNum1,'phone','')) then 'valid.' else 'invalid'"/></p> <p>The phone number <xsl:value-of select="$phoneNum2"/> is <xsl:value-of select="if (re:isValid($phoneNum2,'phone','')) then 'valid.' else 'invalid'"/></p> <p>The zipcode <xsl:value-of select="$zipCode"/> is <xsl:value-of select="if (re:isValid($zipCode,'zipcode','')) then 'valid.' else 'invalid'"/></p> <h1>re:format</h1> <p>The properly formatted form of <xsl:value-of select="$phoneNum1"/> is <xsl:value-of select="re:format($phoneNum1,'phone','')"/>.</p> <p>The properly formatted form of <xsl:value-of select="$zipCode"/> is <xsl:value-of select="re:format($zipCode,'zipcode','')"/>.</p> <p>Here is an example of an alternate regex library implementation for <xsl:value-of select="$phoneNum1"/>, returning <xsl:value-of disable-output-escaping="yes" select="re:format($phoneNum1,'phone','regexLibAlt.xml')"/></p> <h1>re:phone</h1> <p>You could also use the re:phone() function directly, returning <xsl:value-of select="re:phone($phoneNum1)"/></p> </body> </html> </xsl:template> </xsl:stylesheet> The line <p>Here is an example of an alternate regex library ...</p> uses an alternate library for performing regexes, regexLibAlt.xml. The new library itself is significant because it illustrates a way that you can actually generate XML code using the re:format() function (regexLibAlt.xml): <regularExpressions> <regularExpression id="phone"> <pattern>^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$</pattern> <replace><![CDATA[ <phone> <areacode>$1</areacode> <exchange>$2</exchange> <localcode>$3</localcode> </phone>]]></replace> </regularExpression> <regularExpression id="zipcode"> <pattern>^(\d{5})(-\d{4})?$</pattern> <replace>$1$2</replace> </regularExpression> </regularExpressions> Here, I've created a CDATA section that contains the mappings into the XML code: <replace><![CDATA[ <phone> <areacode>$1<\/areacode> <exchange>$2<\/exchange> <localcode>$3<\/localcode> <\/phone>]]></replace> The $1,$2,$3 work as they did in the previous example. Normally, when returned through the <xsl:value-of/> statement, the tagged code is "escaped", with "<" and ">" characters converted into the < and > sequences. However, if you set the disable-output-escaping attribute of the <xsl:value-of/> element to "yes", this escaping is disabled, and you generate pure XML code that you can then pass directly into a variable. Thus, you could use regexes in this manner to build rich XML on the fly. The alternative approach would be to create an XSLT named function for each regex and define the code inline: <xsl:function name="re:phone"> <xsl:param name="str"/> <xsl:variable name="re" select="'^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$'"/> <xsl:variable name="replaceStr" select="'($1)$2-$3'"/> <xsl:result select="if (matches($str,$re)) then replace($str,$re,$replaceStr) else ''"/> </xsl:function> This would then be called as re:phone('888.555.1212') => '(888)555-1212' re:phone('888.5554.1212') => '' Because XPath treats an empty string as being synonymous to the false() function, you can use this in an if() statement to handle both valid and invalid input: <xsl:variable name="phoneNum" select="re:phone('888.555.1212')"/> The phone number is <xsl:if ($phoneNum) then $phoneNum else 'not properly built.'"/> Just as a side note, if you are not familiar with how to run these examples, you need to use the Saxon7.2 parser, available from Source Forge at http://saxon.sourceforge.net. Extract the saxon7.jar file into a working directory in your classpath, then you can invoke these routines from the Windows or Unix command line as currentDir>java -jar saxon7.jar stub.xml regexesText.xsl or currentDir>java -jar -o outputDoc.htm saxon7.jar stub.xml regexesText.xsl if you wanted to direct the output to the file outputDoc.htm. ============================================ Pass the Word ============================================ I'm heartened and gratified by the number of people who have joined the list (60 and counting in two days). I have directed my current domain http://www.kurtcagle.net so that it now points to the Yahoo site, so you can see source code samples and archived columns for this work. I have had a couple of questions as to why I'm using Yahoo groups to do this. At the moment, its a matter of expediancy. My own server is sitting in a storage locker in Portland Oregon while looking for a job, and until I land somewhere (and I am available, email me at kurt@k... for details) it's just easier to use existing tools. Once relocated, I'll probably move this newsletter on to its own server, if nothing else than to escape the annoying advertising (and replace it with my own annoying advertising). I'm doing this newsgroup as a free service. Please, if you like it, pass on the link (http://www.kurtcagle.net) to anyone that you know who might want to keep up with what's going on in my own little corner of the XML world. Until next time ... Kurt Cagle ********************************************** Copyright 2002 Cagle Communications All Rights Reserved ********************************************** ------------- Simon St.Laurent - SSL is my TLA http://simonstl.com may be my URI http://monasticxml.org may be my ascetic URI urn:oid:1.3.6.1.4.1.6320 is another possibility altogether
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|