Seeking advice on handling large industry-standard XML data models [long
I'd say "document models", but this is really a data model in XML guise. We have fairly large XML data-transfer document. By "large" I mean that the document itself is more or less described in around 25,000 lines of (inadequate) XML Schema. This is one of those models like I imagine some of the Oasis documents to be: a data interchange format targetted at a wide variety of processing domains within a vertical or horizontal market. It's really not a common data model; each processing domain may use the same data, but with somewhat different names, constructs, formats, and concepts. Plus there are many "uncommon" types of data that have not found their way into this standard (which will, of course, be added in using elements and attributes under non-standard namespaces). I'm sure for some of you this sounds mundane. Good! -- that's what I'm hoping for. Well, of course, each domain needs to populate documents based on this model. I can see several approaches: 1. Build a type-specific document object model I'm pretty sure I don't want to go this route. This approach may make sense for small, stable document models; but for this one in particular, I'm having a hard time seeing the payback. Reasons: a) it's large This is going to be one monster object model. Source generation from the Schema using Castor or JAXB would offer a good first approach, but neither tool is mature enough yet (Castor can't handle simple-type elements; JAXB chokes on circular schema includes). Even with code generation, however, there's still a lot of work left to be done. The code that is generated will be fine-grained Java beans, which is all well and good but not at a high-enough a level of abstraction to be useful for an API. There's also a lot of co-occurrence constraints in this particular document model that would have to be coded up. And due to a limitation of XML Schema (or a limitation in the XML mindset of the designers, depending on where you sit in the unordered content debate), many of the one-to-one cardinality constraints are defined as one-to-many, so the generated objects will need tweaking. So it's a generate once, tweak many proposition. b) it's somewhat undefined There's also the issue of all those custom attributes and elements that are going to be added at various processing stages. We of course don't know what they will be (except for the ones we create), but we do have to preserve them. This argues for a generic data structure to store such (meta)data, so you're going to have to have a bit of a DOM thrown in as well. c) it's still young New version of spec leads to new, improved class definitions. Oh, and object-to-object data migrations, all hand-coded (or maybe using reflection, if you have the talent and patience). What fun. d) monster potential Given enough implemention cost, managers will seek to reuse. I see a real danger here of this code developing into the Common Object Model. Hey, it talks to all these processes, right? So each process can just use this model directly, right? We can build all our applications on this data model! Unfortunately this makes a lot of superficial sense, which is what managers tend to go by (I was one, once). Given the disparity of the domains this addresses, I just don't think it would work out. Better to populate from app-specific data models to a common *interchange* format, IMNSHO. Granted, Java has a lot of XML libraries; language support is not the concern, it's the implementation of a large document-type specific class hierarchy that I question the value of. 2) Use a generic object model Say, fer instance, DOM 3 (which will be finished real soon now). Advantages: a) it can handle any content Which means it can handle all those custom doc components under non-spec namespaces the same way it handles the spec'd components. b) continuous validation Although the DOM 3 WG tossed aside Abstract Schemas, they still intend to support continous validation last I checked. I think Xerces already implemented continuous validation support for JDOM, but I'm too lazy to check right now. Anyhow, the upshot is that today or someday, all the validation against newly-entered data will be 'free'. Which beats custom code, IMO. But there are disadvantages: c) everybody uses DOM , nobody likes DOM I don't know why.... must be documentation. JDOM is certainly nicer for Java volken. d) not so fast... No matter how you slice it, DOM will be a tiny bit slower in the validation dept. than generated or custom objects. At least that's what I think. 3) Web publishing This is the term I use for the "take XML document, transform, post on server" approach. A lot of people on this list would be comfortable with this approach, but it's kind of radical in my little circle. The basic idea is to build HTML-based forms to handle input. Theoretically, creating HTML forms specific to each domain (think custom views) is a matter of writing new XSLT scripts. Advantages: a) modular and pipelined Take a complex XML document, cut it down to size, and add abstractions where appropriate. Generate an HTML form using another transform. Handle input, run the reverse process, voila! populated interchange data. b) XML tools for managing XML This means experts in XML (the tool authors) handle much of the in-memory management of the document data, including validation. Generic programming languages like Java are nice, but there's a tendency for overkill and "creativity" on the part of developers. It's a little Siren song Java sings to you as you type.... XML-specific tools tend to push you to be task-focused. Disadvantages: a) browser interface I work with Mac developers. If it ain't Aqua, forgeddabowdit. b) tool maturity We need some interactive graphical component to the UI. Nothing fancy, at least nothing SVG can't handle (I don't think). But... binary input is via PDF. There are some converters out there, but I doubt they're going to handle anything complex. And we get some off-the-wall PDF sometimes. We'll also need to generate PDF for reports. XSL-FO may be sufficient, but can't say for sure. c) I don't know scratch about XSL-FO, XSP, JSP, ASP, etc. etc. I'm not a Web developer by any stretch of the imagination, although I have done a proof-of-concept on this. Perhaps just too bleeding edge, eh? Conclusion: First of all, I'd like to thank those of you who've bothered to read this far. (I know I haven't bothered to reread it, so I expect you've seen a lot of typos.) None of these implementation choices are completely exclusive. If I were to put them on a scale from safe and traditional (1) to bleeding edge and radical (3), I'm about at a 2.4. I know some of you have worked with large XML interchange formats for many years, so your insight (or incite, if you're like me) is appreciated. I'm looking forward to hearing about alternatives and better assessments of risk.
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format