[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: binxml proposal
Btw, I guess "Binary content of XML specification" is already given by w3c. This is what used in WML !!! regards, Gopi -----Original Message----- From: owner-xml-dev@x... [mailto:owner-xml-dev@x...]On Behalf Of Wayne Steele Sent: Saturday, April 01, 2000 10:27 AM To: xml-dev@x... Subject: binxml proposal There's been some discussion lately about a binary representation for XML documents. None of the binary-xml proposals I've seen so look that useful to me, so let me present one that I think would make sense. If anyone finds this interesting, perhaps we can move forward and implement it. As a placeholder for a real name, I call this 'binxml'. Binxml is a compression format for XML documents. A well-formed XML Document (A) is mapped into binxml, which is stored or transmitted to another application. At a future point, the binxml is mapped back into an XML Document (B). Documents A and B should be identical for any significant purpose. People may disagree about what is significant and what is not. I have preserved all the obvious things, as well as the Internal DTD subset, and prolog and suffix PIs and comments. I have NOT preserved Whitespace in these places: Inside of the DTD Outside of the Document Element Between attributes I have also not preserved the exact placement of namespace nodes, but I have allowed you to keep the prefices. This might be a problem for some DTDs. I'm assuming the document is well-formed to begin with. It should remain equally valid or invalid, except for the possible changes in namespace nodes. I have not created an encoding for external DTD subsets. I don't see the same needs for compression wrt external DTDs. Just exchange them in plain text, like you do now. The code points 0x00 - 0x08, 0x0B, 0x0C, 0x0E - 0x1F have been declared to be illegal in XML documents, so I have used these as binxml tokens. You can use whatever unicode encoding you want, as long as it doesn't use the listed code points for special purposes. Binxml preserves it. Here's the actual mapping: a binxml file (or stream, or whatever) looks like this: 1. byte-order-mark; if you're using UTF-16 2. a "magic" string; Everybody else seems to be doing it. Actual value TBD. 3. XMLDECL; A single token in lieu of the XML Declaration 4. encoding string; Optional. The document's EncName. 5. String Table Section; Mandatory. 6. Prolog PIs, Comments; If present. The XML Decl is not included. 7. DocTypeDecl and DTD; If present. 8. Prolog PIs, Comments; If present. 9. Document Element and contents; Mandatory. 10. Suffix PIs and Comment; if present 1. Byte order mark. This is just like XML. Because binxml tokens are defined as unicode code points, the encoding needs to be determined up front. If there is no BOM, UTF-8 will be assumed, until the end of the encoding string. 2. "magic" string. This is just a additional check that you've got the right file type. How many characters is about right for this? three? How about: "bx0" for binxml version zero. 3. XMLDECL The XML Declaration in the original document is mapped to this one token. I am assuming XML version "1.0". If another one comes out, we can just add new codes here. There are three possibilities for the standalone declaration: yes, no, and not present. The most common encoding declarations are 'UTF-8' and 'UTF-16', so I have made special allowance for them. If the document has no encoding declaration, use an entry that says 'encoding follows', but omit section 4. If the document has no XMLDECL, use 0x9. Values: 0x1 standalone="yes" encoding="utf-8" 0x2 standalone="no" encoding="utf-8" 0x3 standalone unspecified; encoding="utf-8" 0x4 standalone="yes" encoding="utf-16" 0x5 standalone="no" encoding="utf-16" 0x6 standalone is unspecified; encoding="utf-16" 0x7 standalone="yes"; encoding follows 0x8 standalone="no"; encoding follows 0x9 standalone is unspecified; encoding follows 4. Encoding String. This section may only be present if the XMLDECL token is 0x7,0x8, or 0x9. Valid characters are [a-zA-Z0-9_.:] and '-'. The encoding takes effect (and ends section 4) with the first character outside of this range. The next character should be a binxml token, and they are all outside this range. Optionally, you may follow the Encoding string with a NUL (0x00). This might be needed to mark where the encoding begins for some really weird ones. 5. String Table Section. Each entry is sequentially numbered, starting with one. There are five entry types. When you see [index], it means a reference to one of these entries. [index] is the size of one unicode code point, so it can be as large as 0x10FFFF, if you use surrogates. I'm hoping this will be enough for everyone's documents. This section ends when you hit a binxml token other than 0x0 - 0x4. NamespaceEntry (no prefix specified): 0x1, followed by the text of the namespace URI. When unencoded, any prefix may be used for the namespace declaration in the final document. Elements and attributes in this namespace will of course use that prefix. NamespaceEntry (prefix specified): 0x1, followed by the text of namespace URI, 0x0, text of prefix When unencoded, the same prefix must be used in the output document. Personally, I frown upon giving special meaning to prefices, but XSTL seems to need this. NameEntry 0x2, followed by the text of the Name QNameEntry 0x3, [index], followed by the text of the BaseName The [index] here is for the corresponding namespace to qualify this QName. CDataEntry 0x4, followed by the text If the text needs to have an Entity Reference in it, you may include it with two characters: 0x0, followed by the [index]. EntityReference 0x0, [index] [index] is the Name for this EntRef. 6. Prolog PIs, Comments If there are Processing Instructions and/or Comments in the document before any DocType declaration, they go here. Do NOT put the XML Declaration here. It is addressed in section 3. This section ends when you hit either 0x07, a DocType declaration, or 0x8 or 0xB, for the Document Element. PI 0x5, [index], text content of the PI The [index] is for the Name or QName that is the target of the PI. It is possible for there to be no text content. Comment 0x6, followed the the content of the comment 7. DocType Declaration and DTD This section (if present) always starts with a DocType declaration. This may be followed by a PUBID and SYSID (in any order), if these are present in the document. Next are any declarations in the Internal DTD Subset (if any). This section ends with 0x5, 0x6 (a PI or Comment following the DTD, go to section 8), or 0x8, 0xB (Document Element). DocType Declaration 0x7, followed by the name of the doctype PUBID 0x1, followed by the text of the formal public identifier SYSID 0x2, followed by the URI for the System ID I'm going to skip the internal DTD subset, and come back to it later. 8. Prolog PIs, Comments This is just like section 6, except it can't be followed by a DocTYpe declaration. This is for PIs and Comments that follow the DTD, but proceed the Document Element. 9. Document Element and Contents This is, of course, the meat of the XML Document. In most binxml, this will immediately follow the String Table. Everything in this section is represented in the same order it appears in the source document. Attributes immediately follow their containing element. The two different Attribute types may be freely interchanged. Attributes that declare namespaces (ie, namespace nodes) are not represented. This section ends at the end of the first element. ElementStart 0x8, [index] [index] is for the Name or QName of this element. Any Attributes must follow next. Everything else following, until an EndElement token is reached, is contained by this element. EmptyElementStart 0xB, [index] Like ElementStart, except this element has no child elements or other content - attributes only. Any element start token immediately following this one is a sibling, not a child. EndElement 0x6 AttributeInterned 0xC, [index], [index2] [index] is the Name or Qname of this attribute. Only use a QName if the document had this attribute EXPLICITY qualified (ie, a global attribute). [index2] is the entry for the value of this attribute. It does not have to be a CDataEntry - it may be any other kind as well. AttributeLiteral 0x7, [index], text value of attribute This attribute has the value inline instead of in the String Table. If you need an Entity Reference inside the attribute value, you may include it. EntityReferenceInsideAttribute 0x0, [index] The other tokens can be present in any order inside the content of an element. If text exists without a strarting token, it is just a regular text node. CData 0x4, text inside the CData Section PI 0x11, [index], text inside the PI Comment 0x10, text of the comment EntityReference 0x5, [index] Text 0x3, the text itself This token is only used when a text node immediately follows a comment, a PI, a CDATA Section, or a literal attribute value. Otherwise text identifies itself without any token. Interned Cdata 0x2, [index] The index is to a String Table entry of any type. The contents of that Entry are copy/pasted right here. This may appear inside of Text, a Comment, PI, or literal Attribute Value. 10. Suffix PIs and Comments If you have any PIs or Comments after the Document Element that you care about, put them here. This is just like sections 6 or 8. DTDs, which I said I would come back to. After the DocType declaration (section 7), may follow any number of these DTD Tokens, in (mostly) document order. There will be no Marked sections or Parameter Entities, as they aren't allowed inside the internal subset. Attlist declarations are folded into the element they go with. A different token is used for an element declaration depending on the content type. ElementDecl, Content Type 'EMPTY' 0x3, [index] ElementDecl, Content Type 'ANY' 0x4, [index] ElementDecl, Detailed Content Type Specified 0x6, [index], followed by Content Stuff Content Stuff in any order, one of the characters "(),|?+*" or 0x7 followed by [index], or 0x0 (meaning #PCDATA) Any Attributes for this Element must be declared next. A different token or token-pair is used depending on the type of the attribute. There are forty attribute types: the cross section of {REQUIRED, IMPLIED, default value, fixed default value} and CDATA,ID,IDREF,IDREFS,ENTITY,ENTITIES,NMTOKEN,NMTOKENS,enumerated , enumerated notations}. I have tried to optimize it so the most commonly used declaration just take one token, where the most obscure ones take two. Any Fixed, Default, or enumerated attribute values must be in the String Table. The indexes for these below are shown as [fixed index] or [default index]. Enumerated type may have any number of index entries, terminated by a 0x0. For fixed or Default enumerated types, the first one listed is the default. REQUIRED_CDATA 0x17, [index] IMPLIED_CDATA 0x18, [index] FIXED_CDATA 0x19, [index], [fixed index] DEFAULT_CDATA 0x1A, [index], [default index] REQUIRED_ID 0xC, 0x1, [index] IMPLIED_ID 0x1B, [index] FIXED_ID 0xC, 0x2, [index], [fixed index] DEFAULT_ID 0xC, 0x3, [index], [default index] REQUIRED_IDREF 0xC, 0x4, [index] IMPLIED_IDREF 0x1C, [index] FIXED_IDREF 0xC, 0x5, [index], [fixed index] DEFAULT_IDREF 0xC, 0x6, [index], [default index] REQUIRED_IDREFS 0xC, 0x7, [index] IMPLIED_IDREFS 0x1D, [index] FIXED_IDREFS 0xC, 0x8, [index], [fixed index] DEFAULT_IDREFS 0xC, 0x9, [index], [default index] REQUIRED_ENTITY 0xC, 0xa, [index] IMPLIED_ENTITY 0xC, 0xb, [index] FIXED_ENTITY 0xC, 0xc, [index], [fixed index] DEFAULT_ENTITY 0xC, 0xd, [index], [default index] REQUIRED_ENTITIES 0xC, 0xe, [index] IMPLIED_ENTITIES 0xC, 0xf, [index] FIXED_ENTITIES 0xC, 0x10, [index], [fixed index] DEFAULT_ENTITIES 0xC, 0x11, [index], [default index] REQUIRED_NMTOKEN 0xC, 0x12, [index] IMPLIED_NMTOKEN 0xC, 0x13, [index] FIXED_NMTOKEN 0xC, 0x14, [index], [fixed index] DEFAULT_NMTOKEN 0xC, 0x15, [index], [default index] REQUIRED_NMTOKENS 0xC, 0x16, [index] IMPLIED_NMTOKENS 0xC, 0x17, [index] FIXED_NMTOKENS 0xC, 0x18, [index], [fixed index] DEFAULT_NMTOKENS 0xC, 0x19, [index], [default index] REQUIRED_ENUM 0x1E, [index], [value index 1] ... [value index n], 0x00 IMPLIED_ENUM 0x1F, [index], [value index 1] ... [value index n], 0x00 FIXED_ENUM 0x1, [index], [value index 1] ... [value index n], 0x00 DEFAULT_ENUM 0x2, [index], [default index], [value index 1] ... [value index n], 0x00 REQUIRED_NOTATIONENUM 0xC, 0x1a, [index], [value index 1] ... [value index n], 0x00 IMPLIED_NOTATIONENUM 0xC, 0x1b, [index], [value index 1] ... [value index n], 0x00 FIXED_NOTATIONENUM 0xC, 0x1c, [index], [fixed index], [value index 1] ... [value index n], 0x00 DEFAULT_NOTATIONENUM 0xC, 0x1d, [index], [default index], [value index 1] ... [value index n], 0x00 Other things you might see in the Internal DTD Subset: PUBID and SYSID are just like in section 7, both are optional, and may occur in either order. NotationDeclaration 0x14, [index], PUBID?, SYSID? PI 0x15, [index], content Comment 0x16, content Internal Entity Decl 0x12, [index], replacement text If you need to embed another entity reference in the replacement text, stick in ( 0x13, [index] ) Entity Reference inside an Entity Decl 0x13, [index] Parsed External Entity Decl 0xF, [index], PUBID?, SYSID? Unparsed External Entity Decl 0xE, [index], [ndata index], PUBID?, SYSID? Interned Cdata 0x2, [index] This may only appear (in the DTD) inside of the content of a comment or PI Whew! Not that complicated, but kind of tedious. I hope there are no tokens which would be ambiguous - if there are, it's an error of mine. Open Questions: Should further compression be done for text content? Should it be allowed for the string table to be sprinkled throughout the document, to make it easier to stream-encode XML? Feel free to tell me if you think this is crap, I can take it. Constructive comments are even more welcome. -Wayne Steele ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ *************************************************************************** *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ ***************************************************************************
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|