[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: [Summary, VERSION #2] Media type (MIME) of XML in MS Word?

  • To: xml-dev@l...
  • Subject: Re: [Summary, VERSION #2] Media type (MIME) of XML in MS Word? in Notepad? when compressed? etc
  • From: Amelia A Lewis <amyzing@t...>
  • Date: Tue, 13 Jun 2006 23:40:58 -0400
  • In-reply-to: <448F513D.8030206@z...>
  • Organization: The mysthical world of Talsever!

mime tricks
Ummm ....

On 2006-06-13 19:58:53 -0400 Rick Marshall <rjm@z...> wrote:
> This is messy in the real world and I applaud your attempt to make sense of 
> it - but I despair of you being successful.


> Note that this is very system dependent and anyone trying to make sense of it 
> either goes gray or loses their hair. I have made some more notes for you....

And the notes are mostly good, too, but ... well, you know, I spent a couple of years writing protocol decodes and reading RFCs and turning into a ridiculously pedantic standards-follower (sort of internalizing the protocol parser, or something), so ... a few more notes?

> Costello, Roger L. wrote:
>> MIME types are metadata. A MIME type is not stored within a resource. It is 
>> not stored as a property of a resource. Heuristics are used for determining 
>> the MIME type of a resource. In other words, a system â??guessesâ? what the 
>> MIME type is.
> In the case of browsers they are told what the mime type is by the http 
> header. However in the case of one popular browser it chooses to ignore the 
> mime type in the http header and instead use the extension of the dosument 
> being retrieved. Which is really silly when the "document" is a cgi script.
> Generally in email and browsers the client application is told the mime type 
> - it does not guess
> To see why this is so you have to understand some history. In Unix (and now 
> Linux) the dot extension is arbitrary.

All true, but not actually particularly relevant to MIME.

Unix/Linux/BSD/whoever use a variety of heuristics to guess the file types; see file(1).  BeOS had a file system that allowed metadata to be stored in it, and one of the predefined metadata bits was a MIME type.  Apple HFS (and variants) also allows storage of metadata, but used a four-letter type code (this may have been supplemented with MIME types in recent years, for all I know).

But MIME wasn't invented for file systems.  It was invented for protocols.

So ... when an application is opening a file in a file system, it typically won't care what the MIME type associated with that file is.  In most cases, it either isn't given (and can't get) that metadata, or it just *doesn't care*.  Most Windows programs care not at all about MIME types (they care about extensions, as Roger and Rick have noted); likewise programs written for MacOS didn't/don't care (they care about the four-letter type code).

You can *say* that they map to MIME types, if you'd like, and if it's useful conceptually, but in fact, they don't.  They're oblivious to MIME types.

A few programs, browsers especially, do care, and do perform an explicit mapping when they are opening files from the file system rather than resources retrieved by some network protocol.

Now, it becomes important to note that MIME wasn't invented to support HTTP, and in fact isn't a terribly good fit with HTTP.  MIME was invented to extend the Network Virtual Terminal (NVT), particularly the protocols built on NVT to deliver Internet Message Format, most especially SMTP (NNTP was also very important at the time that MIME was being actively developed).  NVT is defined to be seven-bit clean (only), and has defaults that are rational for a seven-bit channel.  For instance, if you are sending text/plain, but using Shift-JIS to encode the bits, how are you going to send it?  MIME is intended to solve that.  It is also intended to solve the problem of sending data that uses the full eight bits over the (merely seven-bit clean) channel.  The application/octet-stream MIME type is a good example of how much MIME can do with very little: it only says "hey, this is some sorta byte stream, dunno nothin' else".  Base64 encoding, popularized by MIME, rapidly replaced UUencoding and ad-hoc hexadecimal encoding.

This is a large part of the reason that text/xml is deprecated, btw.  The permitted behaviors of intermediaries and recipients for subtypes in the text main type are more variable than XML permits, especially with regard to whitespace, and there are mismatches with character palette and other things as well.

For all its warts (and they are many, disfiguring, hairy and dangling and terrifying to small children and sane adults (fortunately, there aren't too many of either who are network protocol geeks)), MIME worked, significantly enhancing the range of what could be transmitted by email, and ensuring the success of porn^Wbinary newsgroups.

HTTP, at least by 1.1, was emphatically an eight bit clean protocol (hey, years had passed, the network had gotten a lot more consistent as it got bigger).  It didn't need a lot of the MIME tricks, and the original (and most of the ongoing) developers of the protocol had a heartier appreciation for the failings and drawbacks of internet-message-format-based protocols than for their virtues.  Consequently, HTTP defined some new headers with the MIME-reserved prefix ("Content-"), reused the bits of MIME that they thought were hoopy froods, and utterly violated the MIME specification otherwise (by, for instance, changing the "defaults," what it means when a header does *not* appear), and because of the enormous success of HTTP, effectively created a class of "MIME-alike" protocols.

Which brings us to the present, and if you're wondering why there is no such thing as a generic MIME parser library for [insert your favorite programming language or environment here], it's because of the massive confusion surrounding MIME types, and the fact that a document sent over MIME-compliant protocols is apt to look quite different than the same document sent over MIME-alike protocols (you can't use the same parser, because the absence of certain headers has a different meaning for the two protocol classes, and the defaults for some headers in MIME-compliant protocols are *illegal* in HTTP and some other MIME-alike protocols).

Oh, you weren't wondering about that.  Did I wonder off on a tangent again?  Funny old things, tangents.  You never know where they'll lead.  Why, once, I remember ... uh.  No, never mind.  Let's see.  MIME.

XML does not have a MIME type.  There is a MIME type (actually, more than one) for XML, though.  The distinction matters.  You assign a MIME type, when it matters, typically when you're sending something over a MIME-compliant or MIME-alike protocol.  There's a place for the metadata to go, in that case.  When you're not doing MIME or HTTP's MIME-alike stuff, then it's more appropriate to say that there *isn't* a MIME type (exceptions, such as browsers, noted above), but some other roughly-equivalent mechanism.  The MIME type is an externally applied label, pure metadata (similar in a lot of ways to a file "type" extension or a MacOS type code, but *not as persistent* in either case).  In a MIME-compliant or MIME-alike environment, assign the MIME type; in other environments, determine the content using other heuristics.

"More than one" MIME type I said, and I don't see that in your writeup, Roger.  It was pointed out, in the arguments over text/xml versus application/xml, that it is often useful to say *more* about an XML document than merely that it is XML.  Consequently, an extension was invented, which allows subtypes to specify the content/semantic of a particular MIME type *and* specify that this subtype is delivered as XML.  Here's the result of grep xml /etc/mime.types on one of my machines (with the application/vnd.* pseudo-hierarchy elided):

application/rdf+xml                             rdf
application/rss+xml                             rss
application/xhtml+xml                           xhtml xht
application/xml                                 xml xsl
image/svg+xml                                   svg svgz

(those are associated "extensions" on the right)

Note that there are some interesting tricky bits here.  application/xml-dtd is *not* XML!  You want a DTD parser, not an XML parser.  For that matter, application/xml-external-parsed-entity could result in parse errors, if supplied to a parser as an XML document.  SVG has its own MIME type; so do RDF and RSS and XHTML and Docbook, all of which are defined (to some degree) by a schema (of some sort).  BEEP is a protocol ... with a MIME type?  That's either very cool or very scary.

Now a word about the mapping of MIME types to file extensions: typically, the browser doesn't do it.  The browser, according to the HTTP specification, just gets a header that identifies the MIME type, and renders accordingly.  However, for the file: URL scheme, the browser has to supply its own metadata (the file: scheme represents a sort of pseudo-protocol with a lot of looseness).  So most browsers are able to do some file extension to MIME type mapping (often varying between browsers and sometimes depending upon the URL scheme), which adds to the confusion.

What the MIME type mappings are *for*, though, is the HTTP *server*.  The server is supplied with the /etc/mime.types (or its equivalent in the windows registry for IIS?  I dunno nuttin' 'bout no IIS) mapping, and is then able to attach the MIME type metadata when asked to return a particular resource, based on the extension of the file.  This also allows you to do some tricky stuff.  For instance, s'pose you got a directory in your website.  It's normally served based on extensions; .xml is delivered to the browser with the header Content-Type: application/xml, and it renders accordingly (FSVO "accordingly"; it may be a tree, a styled page, or the raw XML).  If the directory is also made available, for instance, via WebDAV, it is not uncommon for it to be forced to text/plain when someone pulls it that way.  This is a nice trick (less so for XML than for application/x-httpd-php and friends, perhaps), 'cause it lets you get the stuff pretty much unchanged, which allows you to edit it.

It's not so much that Unix/Linux doesn't need to change because MIME was invented for them--it wasn't; it was invented for the NVT and IMF.  It's that they don't need to change because they don't need to change; they handle opening files based on how they do it in unix, just as Windows does it they way that Windows does it, and MacOS does it the way MacOS does it, and none of them give a tinker's dam about MIME types.  Only sad network protocol geeks care about MIME types, and write enormously lengthy ranting diatribes ... *cough*!

Gosh, are you people still *reading* this?  Aren't you *bored*?  *laugh*  I mean, perma-threads may be tedious, but this isn't even really about XML!

Amelia A. Lewis                    amyzing {at} talsever.com
And now someone's on the telephone, desperate in his pain; 
someone's on the bathroom floor, doing her cocaine; 
someone's got his finger on the button in some room--
no one can convince me we aren't gluttons for our doom.
                -- Indigo Girls


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.