Paul, I've been bothered by the "format" problem for a while now. Here's a draft article + notes I started to write a bit ago, but haven't touched in months: http://www.mystartmenu.com/streamforms/concept.html and http://www.mystartmenu.com/streamforms/outline.txt > The TypeURI is a type identifier in URI rather than MIME syntax. YES. I haven't seen anything like this published, and I would be glad to do an in-depth analysis/critique if you are serious about persuing this. - Chris -----Original Message----- From: Paul Prescod [mailto:paul@p...] Sent: Thursday, March 20, 2003 2:01 PM To: 'xml-dev' Subject: Opinions I'm curious whether anyone has proposed something like this before. I don't recall stumbling upon it. It just came to me during a bout of insomnia. Don't sweat the details...these are late night ramblings. === Abstract: The Extensible Data Header is a standardized way for text documents to self-identify their text encoding, MIME type and other metadata. Problem Statement: One of the most persistently annoying issues in data management is keeping metadata with the data it describes. The most difficult (and important) sort of data to track is the "format" (encoding and media type) of files. There are a variety of platform specific ways to solve parts of the problem (file extensions, filesystem attributes, shebang lines) but none of them survive the various mechanisms for transmitting data entities, from FTP to HTTP to Jsbber. XML has demonstrated the wide applicability of a solution: transmit the metadata as part of the same stream as the data. Furthermore, XML defines (explicitly and implicity) a bootstrapping process whereby you can detect the fact that the data is XML through its XML declaration, its XML version through its version declaration, its encoding through its encoding declaration and its vocabulary through a DOCTYPE or namespace declaration. This series of bootstraps has been wildly successful. With XML 1.1, it is possible for a PalmOS-based XML parser to reliably detect and decode an SVG document encoded in EBCDIC and using Macintosh newline conventions. (if Macintosh newline conventions are possible in EBCDIC??). XDH aims to extend this level of self-descriptiveness to other data formats. Examples: <?text/rtf version="1.5" encoding="ASCII" DocURI="http://www.biblioscape.com/rtf15_spec.htm"?> \rtf\.... <?application/zip version="1.0" encoding="ASCII" dataEncoding="binary" DocURI="http://www.pkware.com/products/enterprise/white_papers/appnote.html"?> Definitions: An XDH Document is a stream of bytes starting with a region of text known as a Header. document ::= (header | extendedHeader) separator Body A header is a stream of bytes in some Unicode encoding (including historical national encodings such as ASCII, Shift-JIS, etc.). The algorithm for auto-detecting the encoding is the same as that for XML. The production for header describes the post-decoding character sequence. header ::= typeDeclaration metadata? typeDeclaration ::= '<?' TypeDecl? VersionInfo? EncodingDecl? DocURI? DataEncodingDecl? XMLVersion? '?>' TypeDecl ::= mimeType | TypeURI TypeURI ::= URI DocURI ::= URI metadata ::= a single element with element type "xml:meta" The MimeType is a mime type. The TypeURI is a type identifier in URI rather than MIME syntax. Ideally, it can be dereferenced to return information that could be both human and machine readable. Two media types with different TypeURIs are presumed to be different for the purposes of this specification (just as if they were declared with two distinct MIME types). The DocURI is a pointer to human or machine readable documentation about the data format and type. It is distinguished from the TypeURI in that it is not considered an identifier. You could point to one URI for information about the ZIP file format and I could point to another. VersionInfo is any string that meets the XML production of the same name. Its meaning is designed to be defined by the description of the MIME type. The Encoding declaration is as defined in XML. It has the same defaults as XML. The DataEncodingDecl is a pseudo-attribute named "dataEncoding". It defines the Unicode encoding not for the header but for the Body. The value "binary" is used to indicate that no Unicode decoding should be attempted for the Body. If the DataEncodingDecl is omitted, it defaults to the same encoding as the header. Theh XmlVersionDecl declares what version of XML is in use. It defaults to 1.1 (???). The metadata is just an XML element with arbitrary children and attributes. Each child element and attribute must have an XML namespace and processors should ignore elements or attributes in namespaces they are not programmed to recognize. If the Body is in a different encoding than the header (especially binary) then the separator must be the character sequence FF, SUB, EOT (aka "^L^Z^D" aka "FORM FEED", "SUBSTITUTE", "END OF TRANSMISSION") which should serve to visually separate the text from the binary data in the terminal programs of most computers. If the Body is in the same encoding as the header then the first line of the Body is either the line immediately following the "xml:meta" element or (if there is no such element) the line immediately following the typeDeclaration. If the Data begins with text of the form "<xml:meta" then the metadata element defined by this specification may not be omitted. The Extended Header The extended header is designed to support pre-existing uses for the first lines of files. It basically defines syntactic variations of the base header that are allowed for file formats designed before XDH (for instance programming language files). extendedheader ::= shebangLine? CCommentStart? header CCommentEnd? shebangLine ::= #! Char* #xA CCommentStart ::= S? "/*" S? CCommentEnd ::= S? "*/" S? In an extended header, any line may begin with a shellComment or CPlusComment. If so, the comment is ignored and the data is treated as if it did not exist. shellComment ::= S? ("#" S?)+ CPlusComment ::= S? ("//" S?)+ For example: #!/usr/bin/python2.3 # <?application/x-python version="2.3"?> import x import y print "z" Backwards Compatibility This specification does not change the definition of any pre-existing media types. They should be interpreted as per their various specifications. For example, most Unix systems will not support UCS-2 shell scripts even though this specification might allow such a declaration. The specification does, however, allow the addition of metadata to those media types for software applications that understand this specification. It is anticipated that new specifications will make normative references to this one so that this mechanism can replace the various ad hoc mechanisms for self-description and inline metadata. ----------------------------------------------------------------- The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: <http://lists.xml.org/ob/adm.pl>
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format