Re: What is XML For?
On Fri, 25 Oct 2002, Paul Prescod wrote: > >><!ELEMENT purchaseOrder (buyer, seller, ...)> > > > > > > That agrees nothing! > > It agrees that the buyer precedes the seller and both go within the > purchaseOrder. It *states* that, doesn't meen anyone *agrees* :-) But it doesn't say whether the buyer or seller are denoted as URIs, numeric ids, or string names, or whatever... the point I'm getting at is that you need more than just a DTD. It's nice to have a standard way of writing parts of the standard, but you still need to write up a lot of other stuff about the meanings of things and so on. > > Do you deny that groups of people get together to produce things like XHTML, > > and vertical industry message formats? Because they do. The presence of DTDs > > and schemas doesn't remove this requirement. > > DTDs and schemas give you a structure for _expressing_ your agreement in > a human and machine-readable manner. Yep! And that's all there is to it. > The difference between > > a) installing a schema and reading its surrounding semantic > documenation and > > b) reading a spec for a binary file format is massive. > > So massive, in fact, that the XML project is within the means of the > average business programmer and the binary project is not. No way. Have you ever looked at a spec for a binary file format? Most of the ones I've deal with have taken a few hours to bang out an implementation of (except TIFF; implementations of TIFF are never finished...) > >... > > The Internet protocols don't have a formal schema notation, they're just > > defined in English in RFCs. And they're more widespread than XML, I reckon; > > it hasn't harmed them, has it? > > Have you ever tried to deploy a new Internet protocol? Actually, yes :-) > It is near impossible. That's why there are so few widely deployed ones. No it's not... I've got quite a few custom protocols I put together lurking around my systems. The implementation of the latest one is quite simple: SERVER - run from a cronjob at x pm pg_dump <details of database> | mcrypt -e <key> | nc -l -p xxxx CLIENT - run from a cronjob at x:05 pm (to allow for clock skew) nc -p xxxx server | mcrypt -d <key> > database.dump ...but I've also produced a few RPC protocols. Let's see if any of them are lying around... hmmm... not handy but take a look in /usr/include/rpcsvc on a Unix box for a few. I've also put together a replacement for RMI that's a bit less tightly bound (the default RMI implementation is somewhat fragile!). > Now compare that effort to deploying a new XML vocabulary. Sure, non-XML > formats can become popular, as informally specified protocols can become > popular. The question is how much effort it takes. This effort greatly > impacts the _likelihood_ of the format/protocol gaining popularity. It's not that hard :-) Try it! > >... > > Stop and think about that. What differentiates these products, hmm? Do I buy > > Corel because it uses a different in-memory data structure to xfig? > > Insofar as people buy products in part for their performance, the answer > is definately YES. In particular, I find it absurd that you would argue > that SAP and Quickbooks should use the same data structures despite the > fact that one runs relational-backed enterprises and the other > Windows-hosted small businesses. SAP _could not_ get away with using the > same datastructures that QuickBooks does. Why not? QB could embed a small SQL server - you can get in-memory SQL servers - and use an identical table layout... if there are enough differences between the small business and large business *models* then you're comparing apples and oranges anyway. > > Yep, just because I know more about bitmap file formats - overall, we are > > discussing data interchange in general; you brought up vector files as an > > example, I bring up bitmap files. > > You well know that almost nobody proposes to use XML for bitmaps. No, but that's not the point! It's just an area of file formats that I happen to know lots about, having implemented most of the common ones. > Furthermore, vector graphics provide many more opportunities for > optimization based on intelligent choice of data structures. I have a > friend who built a commercially successful graphics program around a > _single_ proprietary vector graphics algorithm/datastructure pair. It > allowed certain kinds of scaling that were impossible with the more > traditional algorithms. And in fact this is a very common case in the 3D > graphics world. But it's still a list of objects, perhaps with a semantic tree such as an object grouping / contaiment hierarchy and maybe with layers. Your in memory structure *has* to have that or else it's discarding information it'll need when it comes to saving the file again (dedicated readers that know they only need a subset of the information are a different matter, though). It may overlay that tree with a lookup index, but that tree will still be there... > > "...into memory as a character array..." > > > > Not a *byte* array! > > So the on-disk representation is _different than_ the in-memory > representation. Depends what level you look at - sure, most machines use magnetic disks as opposed to electronic memories these days :-) But in Java it's all just characters to the programmer's level. This is drifting off of the original point, though; I was not aiming at *bit* equivelance but *structure* equivelance. You were arguing against things that automatically map from your XML or binary data to, say, Java objects since you thought you'd want a different structure in memory as on disk; I maintain that you rarely if ever want to do more than add extra indexing. Let's look at some examples. 1) Vector graphics, although not my speciality What is the data model? Usually some kind of space partition tree with the leaf nodes being one of a set of primitives. The tree can either be semantically based - the grouping of objects into larger objects - or a more rigid thing like a BSP tree or an octree that is used to optimise certain lookup operations. In the former case you'll need that structure to exist both on disk and in memory since otherwise semantically meaningful information will be lost. In memory you might add a lookup table from object ID to the actual object to avoid having to walk the tree to find arbitrary objects, perhaps. 2) Plain text The only model for this other than a list of characters that I've seen is a list of lines, each of which is a list of characters - and a slightly funny one involving two lists of characters which is used in some editor implementations. Either way you still have the same underlying sequence, just you break it up in various ways into easier to handle chunks. 3) Bitmapped images These always come down to 'some metadata' and 'a 2D array of pixel values'. The most variation is in the metadata; applications will tend to have a native model which is identical to the metadata system of their favourite format and map other formats in and out of that. 4) A table of information, SQL stylee There's a fair few indexing schemes that can be applied here, but it's still a table; an ordered multiset of tuples. (a pure relation in the mathematical sense is an unordered set of tuples). > In my experience it is painful and obfuscatory to use CSV for > hierarchical or linked information. But if you and your customers enjoy > it, then I'm glad you're using what works for you. Not that much information is hierarchical, certainly by bulk... it's fine for links, though, since it's pretty much the SQL data model and it's easy to have foreign keys. We're going to add a more hierarchical structure in future (to allow some fields to contain lists and tables); the jury's still out on the details of that for interchange, but XML probably still won't be a great contendor since we'd ideally not have to change EVERYTHING about the file format for a little thing like that. For now we'll probably go for something like: email,faveFoods "alaric@a...","Cheese"+"Yoghurt"+"Pizza" ...and just add another transition into the parser's state machine for the + symbol after a closing quote leading back to the state that comes after a , with appropriate actions on the data buffer. > Paul Prescod ABS -- Alaric B. Snell http://www.alaric-snell.com/ http://RFC.net/ http://www.warhead.org.uk/ Any sufficiently advanced technology can be emulated in software
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format