[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: STAX Re: Abbreviated Tag Names
This inspired the following idea for a human readable (with some counting) compression scheme: Background: 1. What we are trying to compress is tag names and attribute names but no data. (PCDATA or attribute values) 2. Names in XML can not start with digits. [5] Name ::= (Letter | '_' | ':') (NameChar)* When writing the document to a stream the following algorithm is used: Every time a Name is to be written, except for end tags wich are always written as </>, the following rules apply in order: 1. If it is the same name as the last name that was written, use the empty string. otherwise continue to [2] 2. If the name has been written earlier in the document, replace the name with the number for the 'order of appearance' of the name. ie, the first name to appear in the document, typically the element name of the document element will get the number 0. (zero based indexing is assumed) 3. If it is the first time the Name is used, write the whole name unchanged and add it to the table of used names together with it's number of appearance, and increase the appearance number counter. This goes for both element and attribute names. When parsing the compressed document the following is done: Each time a valid Name is parsed (not starting with a digit) add it to an array of names Each time a number is parsed in the place of a name, look the name up in the array using the number as the index My experience is that the number of different Names in an xml document is usually not that large and I think that the implementation should be quite simple. As an example the following example taken from the XML Schema Primer is compressed: <?xml version="1.0"?> <purchaseOrder orderDate="1999-10-20"> <shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> <productName>Baby Monitor</productName> <quantity>1</quantity> <USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate> </item> </items> </purchaseOrder> The following names were found in order of appearance: purchaseOrder 0 orderDate 1 shipTo 2 country 3 name 4 street 5 city 6 state 7 zip 8 billTo 9 comment 10 items 11 item 12 partNum 13 productName 14 quantity 15 USPrice 16 shipDate 17 The compressed document: <?xml version="1.0"?> <?staxx?> <purchaseOrder orderDate="1999-10-20"> <shipTo country="US"> <name>Alice Smith</> <street>123 Maple Street</> <city>Mill Valley</> <state>CA</> <zip>90952</> </> <billTo 3="US"> <4>Robert Smith</> <5>8 Oak Avenue</> <6>Old Town</> <7>PA</> <8>95819</> </> <comment>Hurry, my lawn is going wild!</> <items> <item partNum="872-AA"> <productName>Lawnmower</> <quantity>1</> <USPrice>148.95</> <10>Confirm this is electric</> </> <12 13="926-AA"> <14>Baby Monitor</> <15>1</> <16>39.98</> <shipDate>1999-05-21</> </> </> </> By compression we're saving the following text from being written: name, street, city, state, zip, shipTo, country, name, name, street, street, city, city, state, state, zip, zip, billTo, comment, productName, quantity, USPrice, comment, comment, item, partNum, productName, productName, quantity, quantity, USPrice, USPrice, shipDate, item, items, purchaseOrder [229 characters] 18 characters for the numbers were added, saving a total of 211 characters out of originally 722. the processing instruction <?staxx?> was also added, 9 chars The transfer was reduced with about 30% in this case. Compared to the original compression scheme this should compress a bit better, especially for longer documents where there are more than a few occurences of the same Name. It does however require a stack for handling the end tags, as well as dynamic lookup tables during writing and parsing. I don't know if it is still SGML compliant, but I guess that there are enough SGML expertise on xml-dev to help with that issue. Please comment on anything I might have missed (It seemed a bit too easy to actually work) Rick Jelliffe wrote: > > If anyone is interested in taking this further, I think it would be good. > And it is probably the kind of small infrastructure upgrades that could be > fun and doable for open-source and collaborative development. > > Cheers > Rick Jelliffe -- Fredrik Lindgren Goyada AB
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|