Re: Some thoughts on 'direct access' to XML (long)
On Saturday 18 January 2003 10:54 am, Sean McGrath wrote: > If you do that you have - perhaps without realizing it - made your XML > significantly > less useful. Your XML has become process specific. Change your object > structure (because of requirements changes or bugs) and all your > serializations instantly turn into legacy. Interop with other systems > is no better that it would have been with Java serialized objects, > Python pickles, marshalled CORBA objects etc. I disagree here on one level. There is nothing really that different between XML, Java serialisation, Python pickles, and especially marshalled CORBA. In all cases, you have some defined data model - XML's data model (like it or not!) being either 'bits of text wrapped in elements and attributes etc etc' or the whole PSVI thing, depending, Java serialisation's data model is based around a few basic types - chars, ints, longs, booleans, etc - plus a single constructed type, the object, which is nowt more than a a load of named fields with values. It has a concept of pointer which XML lacks. Java serialisation has nothing whatsoever to do with *methods* or *code*, however; it stands up without that. The only implementation widely in use happens to be written in Java and happens to parse the serialised stuff by construction Java objects, mind, and the de facto 'schema' language for Java serialisation happens to be Java classes (although the serialisation stuff ignores all the methods and just looks at class names, inheritance, and the definitions of fields). You can put stuff in the Java methods to override serialisation behavour, true, but that's just part of the particular serialisation algorithm the aforementioned Java implementation uses. There's nothing stopping you writing a C library to read and write Java serialised object files. You'd use it something like this: FILE *fp = fopen ("test.ser","r"); Java *j = parseJavaObject (fp); Java *j2 = getJavaObjectField (j, "firstName"); JavaCharArray *name = getJavaIntField (j2, "_data"); printf ("First name: %s\n", convertJavaCharArrayToCString (name)); destroyJavaObject (j); I'm guessing off of the top of my head that the serialised form of a Java String object has an internal character array called _data; this is defined in some spec somewhere but I don't have it to hand. Anyway - it's not really any different to talking PSVI, say. CORBA marshalling is even more so; whereas Java serialisation is designed to be able to easily freeze the state of Java objects, and as such has a type system suspicously related to Java's, CORBA was designed around a new data model from scratch, like ASN.1, that's intended to just be good for modelling information. The difference between XML/CORBA/BER/PER/DER/etc on one hand, as language-independent data modelling systems, and Java serialisation / Python pickling / etc is really only down to two things: 1) The data model being designed to map seamlessly into a particular language's model 2) Nobody spending the effort writing other implementations Now, where I work, we were storing serialised Java in a database (for various reasons, mainly to do with getting around inflexibilities in SQL's type system). However, we wanted to offload certain operations to the database server as a stored procedure written in C. We were already using a modified serialisation mechanism to avoid space inefficiencies in Java's serialisation, but I believe it would have been just as easy using straight Java (although I'd have had to hunt down the serialisation specs first rather than just having our specs for our own format lying around); but we wrote a C interface to our serialisation format, and we are now performing processing on these Java objects from C, and we're happy! We can't call the Java methods from C, but we can access all the object's fields... > 7. Programming languages can and should move past SAX/DOM for > accessing XML. For pure document processing, they both have > their place but for Objects and Records (as the terms are used > in mainstream programming), they are sorely lacking. I believe > it is entirely possible to make the programmers life easy > WITHOUT turning BOXED XML into basket of object-serialization > technologies. I'd agree with you there, though, but only because I don't think that particular serialisation techniques really impinge that much on the formats of the underlying bit streams. Put it this way, Java serialisation, my own custom compact Java serialisation, and XML all have, to a significant extent, the following structure for a 'compound thing': - Some kind of thing-type identifier - Maybe a length count here - A list of things that are nested inside the current thing - If we didn't have a length count, and end-thing marker Then a notion of one or more 'non-compound things' which are marked as such in their type identifier and/or by context, and have the same basic format but something else instead of the 'list of things'. In my serialisation format, the thing-type identifier is a single byte with a specified list of values. There can never be more than 256 possibly types, you might think, but type zero is 'user class' (followed by the class name) while all the others are things like backpointers to previously serialised objects, end markers, all the basic Java types (int, long, boolean, etc) plus a host of the java.util.* collection classes which I special-case the encoding of to make them more compact. In XML, thing-type is a QNAME, there's a slight caveat in that the nested things might be attributes or child elements, and non-compound things are PIs, comments, CDATAs, etc. In Java serialisation, from vague memory, the thing-type ID is a single byte for basic types and back-pointers, or L followed by a class name for everything else, and there's a special marker that goes after the last field of an object to mark the end rather than a length code. I used length codes in my serialisation thingy because the C code does some fast xpath-esque stuff where it pulls out a single field or two from an object without constructing the entire thing as a tree in memory. If Java serialisation were a bit more compact and they had libraries for reading it from C, we might have used it as is. If it was still large but had C libraries I'd have been sorely tempted, and if it was compact but lacked C libraries I'd have just written the C libraries... ABS -- Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream Heed the path that led me to that place, Yellow desert screen
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format