[Home] [By Thread] [By Date] [Recent Entries]
>-----Message d'origine----- >De : Elliotte Rusty Harold [mailto:elharo@m...] >Envoye : mardi 15 janvier 2002 16:17 >A : xml-dev@l... >Objet : Re: Xml is _not_ selfdescribing > > >At 2:52 PM +0100 1/15/02, Jens Jakob Andersen, PDI wrote: >>Hello all >> >>I think that it is fair to conclude now, that XML is _not_ any more >>selfdescribing than e.g. CSV files. >> > >That's ridiculous. XML absolutely is more self-describing than CSV. >Nothing here has proven otherwise. Your claim is indicative of the >flawed binary logic that pervades much of the Internet. XML is not >perfectly self-describing. Therefore it is not self-describing. But >that's only a syllogism in binary logic. The real world isn't binary. >It's fuzzy. There are degrees of things, including degrees of >self-description. > >No serious analysis of how XML is actually used vs. how CSV files are >actually used could possibly deny that XML is more self describing. >The possibility that XML tag names could be chosen randomly does not >evade the fact that they are not chosen randomly in the vast majority >of cases. The evidence that some (though far from all) XML >applications use extremely opaque tag names does not imply that there >is no meaning there, or that this meaning cannot be teased out of an >XML document by a sufficiently determined researcher. The need for >genuine intelligence to comprehend and make use of this meaning does >not make it useless. > >In reverse, the possibility of using column names in CSV files does >not help in any way with the large proportion of CSV files that don't >use column names. That the rows of a CSV file can match the column >names doesn't help at all when they don't. In the real world, XML is >simply easier to work with than CSV. Once again, the problem here is very subtle. Tag names do improve self-description of XML tags in the same way that CSV column names does. If we consider XML a way to serialise labelled trees, the simplest readable equivalent to column names in a serialised representation of a labeled tree is to write schema elements names in place, like XML tag names or YAML elements names. Separating data from meta-data in a header and body fashion (a la CSV) could be possible is some cases, but not readable at all (try to picture it). The subtle thing here is that apart from the hierachical vs. flat structure difference, very is no semantical "leap" from CSV to XML. Tag names are basically the same things as column headers. Therefore, there is, I insist, no more self-description of data in XML documents than in CSV files. You point the fact that lots of CSV files don't have column headers. Fine. Then let's just create a CSV++ specification that enforce column headers. Et voila ! We got a so-called "self-describing" CSV format. Like Bill de Hora says, there is no magical means by which a program can understand XML data better than CSV data. This is, however, a claim that has been heard often enough to justify the fact that we react against this. I think that for most people on this list, "self-description" just means that the meta-data being embedded with the data, a human reader does not need to refer to a separate documentation or schema to find out what the data means. However, the fact that the meta-data is embedded does not change the nature and meaning of the data for a computer. Data remains basically data, and a human mind is required to interpret it and write programs that manipulate it. However, for newspapers or IT managers, "self-description" is a term that adds knowledge and intelligence to the data, meaning that a computer program could use the self-description (i.e. the meta-data) to learn and adapt from updated or unknown document types. Mentioning "self-description" make the innocent reader believe the programs will use this self-description, which is not true (as mentioned earlier, very very few programs process data at the meta-data level). This is a corrolar to the "XML is the Lingua Franca of IT" (whereas it should only be considered as one of its alphabets). This concept of "self-description" spawned the idea that programs could make sense of any kind of XML document, hence the "XML as the next computing revolution" hype. And sooner or later, we'll have to pay the price of arrogance, as more and more people will find out that XML is "just another data format". This could be the roots of a severe backlash, that would throw away the baby with the bathtub and benefit the next "computing revolution". Regards, Nicolas P.S. instead of CSV, you can read "any kind of tabular data format precise enough not to worry us with character sets, character escaping, etc.". As Mark Seaborne pointed it out, CSV is hardly a format.
|

Cart



