(Fwd) Abbreviated Format for XML
I am forwarding the following to XML-DEV on behalf of the W3C-AF activity. >Reply-To: <avr@w...> >From: "A.V.Ril" <avril@w...> >To: "Peter Murray-Rust" <peter@u...> >Subject: Abbreviated format >Date: Thu, 30 Mar 2000 13:48:16 +0100 >Message-ID: <001501bf9a49$b44e8c80$9999a8c0@p300> >MIME-Version: 1.0 >Content-Type: text/plain; >charset="iso-8859-1" >Content-Transfer-Encoding: 7bit >X-Priority: 3 (Normal) >X-MSMail-Priority: Normal >X-Mailer: Microsoft Outlook CWS, Build 9.0.2416 (9.0.2910.0) >X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 >In-Reply-To: <38E276C2.2404C26B@m...> >Importance: Normal >Precedence: bulk > >Peter > > Please could you forward the following to XML-DEV as I am not subscribed. > Many Thanks > > Veronica >--------------------------8X------------------------- > >The W3C's XML Activities include an Abbreviation Format Activity which >has been preparing a draft specification for XML compression. Normally this >activity is for members only at this stage, but in response to the discussion on >XML-DEV we have decided to make the first release of the draft available at: > >http://www.w3.org/NOTE/XML-AF-2000-04-01.html > > Please mail xml-af-list@w... with comments or queries. > > A. Veronica. Ril I have checked this site but it is still password-protected. I have mailed Veronica but it will take a little while to unprotect so I have her permission to summarise. Since "XML *is* SGML" it is possible to use SGML minimisation techniques in reducing the number of markup characters in a document. As all XML documents are well-formed, there is no explicit need for end-tags, and these can be replaced by newlines (technically REs or RSs in SGML - they are essentially the same, but subtly different). Because this may make the element nesting ambiguous, a DTD is prepended which defines unambiguous content models for every tag. Since every end tag is replaced by a newline, all start tags are found at the start of lines and therefore the STAGO and STAGC characters ("pointy brackets") can be removed (remember that GIs cannot contain whitespace). This is an improvement, IMO, towards "XML shall be human-readable and reasonably clear" since there are no angle brackets. Thus a document of the form: <greetings>Hello World!</greetings> is compressed to: greetings Hello World! For a document of the form: <foo><bar>bar content</bar></foo> we transform to: <!DOCTYPE foo [ <!ELEMENT foo (bar)> <!ELEMENT bar (#PCDATA)> ]> foo bar bar content The nesting is unambiguous because of the content model. It can be proved by forest automata that all documents can be reduced to unambiguous forms and a suitable DTD written. Software to parse and maximise documents of this sort in to WF-XML already exists (nsgmls). Creating the compressed representation is merely a matter of running nsgmls backwards. Since James Clark has made the code OpenSource it is a trivial matter to reorder the code in the reverse direction and recompile to slmgsn. (There is no need to try to *understand* the code, which is beyond mortals like me anyway!). We also create an "SGML declaration" in a separate file. This is also compact since there are no vowels in it and all strings are <= 6 characters. It is therefore very readable, since vowels are redundant. The process therefore consists of: XML document --> slmgsn --> XML-AF over the wire --> nsgmls --> XML document The document is therefore compressed automatically (while still remaining as valid SGML) and then reconstituted into its original form by the pre-parser (nsgmls). Further minimisation is possible. Since SGML forbids duplicated enumerated attribute values, the names of attributes can be omitted. By reverse compiling all attributes into enumerations in the DTD, no names, no equals signs and no quotes need be included, which saves a great deal of traffic. Thus: <foo att2="d">fudge</foo> can be minimised to: <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> <!ATTLIST foo att1 (a|b) #IMPLIED att1 (c|d) #REQUIRED ]> foo d fudge again with an increase in human readability. The DTD can be used to differentiate the content from the attribute value. Note that in this case the document has to contain a "foo" element, so the element name can be omitted. The document now looks like: d fudge which is about as short as we can get. The human reader can easily work out the tags and attribute names from the DTD. When long words occur repeatedly in the text, they can be minimised through entities. Thus a long word like "internationalisation" can be defined as a text entity in the DTD and referred to in the text as &i18; This is another great saving in transmission and, because of the shorter volume of text, it is clearly more readable. It may appear that the document has become minimised at the expense of the DTD, XML-AF have suggested a clever way round this. Common text entities are collected into "entity sets" and these can be pre-distributed with XML parsers, browsers and other client-side software. Various multilingual dictionaries have been engineered in this fashion. Similarly, common DTDs and Schemas will be enhanced as XML-AF and since most of these will be built into the browser anyway, users will only need to send the minimised XML document. To manage the entity sets, schemas and DTDs XML-AF have suggested the concept of a "catalog". This catalog can use URIs or FPIs to reference the entities to be used. By careful use of FPIs the actual entities sets need only be referenced, not distributed. Some documents, especially purchase orders, will become very common. In this case substantial parts of the purchase order will be "boilerplate" XML and will be invariant. These can be defined as larger entities and pre-installed on clients. In this way many documents will consist of a few entity references, enormously cutting down on traffic. Indeed, for repeat orders it is only necessary to send the URL for the DTD and a single entity reference. I shall certainly be developing a CML version of this. Rather than transmitting complete molecules over the WWW, I now only need to transmit an entity reference on the assumption that every client will have (or be able to download) a DTD describing that molecule as an entity. Note that a purchase order might now look like: Thomas Pynchon 3 Acme 01/02/03 It combines extreme terseness - no characters are wasted - with complete human readability. Everyone will be using the same schema for purchase orders (the one in the current schema-0 document, since no one has yet managed to work out how to write other ones). The document above is therefore unambiguous. A really exciting possibility is that schemas themselves can be similarly compressed. Since a schema *is* XML and since XML *is* SGML, schemas will be compressed to human-readable length. This is, of course, only suggested as a compressed transfer format. However its other virtues (readability and compatibility with SGML) mean that it may even start replacing XML V1.0 in critical places. Note, of course, that conventional compression techniques (ZIP, LZW, etc.) can still be applied to the result, which will normally be only a few bytes. I commend the work of the XML-AF activity and look forward to seeing implementations. P. *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ ***************************************************************************
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format