[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Opinions

  • To: "'xml-dev'" <xml-dev@l...>
  • Subject: Opinions
  • From: Paul Prescod <paul@p...>
  • Date: Thu, 20 Mar 2003 11:01:26 -0800
  • User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.3a) Gecko/20021212

transmission char
I'm curious whether anyone has proposed something like this before. I 
don't recall stumbling upon it. It just came to me during a bout of 
insomnia. Don't sweat the details...these are late night ramblings.

===

Abstract:

The Extensible Data Header is a standardized way for text documents to
self-identify their text encoding, MIME type and other metadata.

Problem Statement:

One of the most persistently annoying issues in data management is
keeping metadata with the data it describes. The most difficult (and
important) sort of data to track is the "format" (encoding and media
type) of files. There are a variety of platform specific ways to solve
parts of the problem (file extensions, filesystem attributes, shebang
lines) but none of them survive the various mechanisms for transmitting
data entities, from FTP to HTTP to Jsbber.

XML has demonstrated the wide applicability of a solution: transmit the
metadata as part of the same stream as the data. Furthermore, XML
defines (explicitly and implicity) a bootstrapping process whereby you
can detect the fact that the data is XML through its XML declaration,
its XML version through its version declaration, its encoding through
its encoding declaration and its vocabulary through a DOCTYPE or
namespace declaration. This series of bootstraps has been wildly
successful. With XML 1.1, it is possible for a PalmOS-based XML parser
to reliably detect and decode an SVG document encoded in EBCDIC and
using Macintosh newline conventions. (if Macintosh newline conventions
are possible in EBCDIC??). XDH aims to extend this level of
self-descriptiveness to other data formats.

Examples:

<?text/rtf version="1.5" encoding="ASCII"
	DocURI="http://www.biblioscape.com/rtf15_spec.htm"?>
\rtf\....

<?application/zip version="1.0" encoding="ASCII" dataEncoding="binary"

DocURI="http://www.pkware.com/products/enterprise/white_papers/appnote.html"?>


Definitions:

An XDH Document is a stream of bytes starting with a region of text
known as a Header.

document ::= (header | extendedHeader) separator Body

A header is a stream of bytes in some Unicode encoding (including
historical national encodings such as ASCII, Shift-JIS, etc.). The
algorithm for auto-detecting the encoding is the same as that for XML.

The production for header describes the post-decoding character
sequence.

header ::= typeDeclaration metadata?

typeDeclaration ::= '<?' TypeDecl?
		VersionInfo?
		EncodingDecl?
		DocURI?
		DataEncodingDecl?
		XMLVersion?
		 '?>'

TypeDecl ::= mimeType | TypeURI

TypeURI ::= URI

DocURI ::= URI

metadata ::= a single element with element type "xml:meta"

The MimeType is a mime type.

The TypeURI is a type identifier in URI rather than MIME syntax.
Ideally, it can be dereferenced to return information that could be both
human and machine readable. Two media types with different TypeURIs are
presumed to be different for the purposes of this specification (just as
if they were declared with two distinct MIME types).

The DocURI is a pointer to human or machine readable documentation about
the data format and type. It is distinguished from the TypeURI in that
it is not considered an identifier. You could point to one URI for
information about the ZIP file format and I could point to another.

VersionInfo is any string that meets the XML production of the same
name. Its meaning is designed to be defined by the description of the
MIME type.

The Encoding declaration is as defined in XML. It has the same defaults
as XML.

The DataEncodingDecl is a pseudo-attribute named "dataEncoding". It
defines the Unicode encoding not for the header but for the Body. The
value "binary" is used to indicate that no Unicode decoding should be
attempted for the Body. If the DataEncodingDecl is omitted, it defaults
to the same encoding as the header.

Theh XmlVersionDecl declares what version of XML is in use. It defaults 
to 1.1 (???).

The metadata is just an XML element with arbitrary children and 
attributes. Each child element and attribute must have an XML namespace 
and processors should ignore elements or attributes in namespaces they 
are not programmed to recognize.

If the Body is in a different encoding than the header (especially 
binary) then the separator must be the character sequence FF, SUB, EOT 
(aka "^L^Z^D" aka "FORM FEED", "SUBSTITUTE", "END OF TRANSMISSION") 
which should serve to visually separate the text from the binary data in 
the terminal programs of most computers.

If the Body is in the same encoding as the header then the first line of 
the Body is either the line immediately following the "xml:meta" element 
or (if there is no such element) the line immediately following the 
typeDeclaration. If the Data begins with text of the form "<xml:meta" 
then the metadata element defined by this specification may not be omitted.


The Extended Header

The extended header is designed to support pre-existing uses for the 
first lines of files. It basically defines syntactic variations of the 
base header that are allowed for file formats designed before XDH (for 
instance programming language files).

extendedheader ::= shebangLine? CCommentStart? header CCommentEnd?
shebangLine ::= #! Char* #xA
CCommentStart ::= S? "/*" S?
CCommentEnd ::= S? "*/" S?

In an extended header, any line may begin with a shellComment or 
CPlusComment. If so, the comment is ignored and the data is treated as 
if it did not exist.

shellComment ::= S? ("#" S?)+
CPlusComment ::= S? ("//" S?)+

For example:

	#!/usr/bin/python2.3
	# <?application/x-python version="2.3"?>
	import x
	import y
	print "z"



Backwards Compatibility

This specification does not change the definition of any pre-existing 
media types. They should be interpreted as per their various 
specifications. For example, most Unix systems will not support UCS-2 
shell scripts even though this specification might allow such a declaration.

The specification does, however, allow the addition of metadata to those 
media types for software applications that understand this specification.

It is anticipated that new specifications will make normative references 
to this one so that this mechanism can replace the various ad hoc 
mechanisms for self-description and inline metadata.
	


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.