[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Discovering document types - best practice?
At 12:13 PM 6/24/99 +0100, james@x... wrote: >This may seem like a simple problem, but I can't find any references to how best to solve >it. It seems like it should be a simple problem, but there are lots of complexities along the way. XML has no reliable 'document type identification' mechanism because of the approach it takes to validation. (Internal subsets can change the rules on a per-document basis; namespaces and validation don't get along very well at present, MIME types aren't yet commonly used for XML, and application/xml doesn't tell you anything about what vocabulary is used, just that it's XML.) >I need to process a series of xml documents, which can be in a number of different >formats. I don't know in advance the type of the documents, only their URLs. What is the >best way of analysing what type the document is (and how to process it)? Is there a "best >practice" for this? I wish there were... as time goes on, I do expect more XML documents to get MIME types identifying them specifically, but this hasn't happened yet. For some of the complexities involved in this process, see the discussion archives at http://www.imc.org/ietf-xml-mime/. When MIME types get straightened out, you could use HEAD requests to the server to get a MIME type back and base your processing on that rather than downloading entire documents in order to determine if they fit your requirements. >For example, should I > >1) Try and read the document type declaration? If so, what function/property should I be >using? I'm using MS XMLDOM (from IE 5). I haven't gotten this close to Microsoft's XML processors in a while, having been burned a number of times, so I don't know the actual API. Even if you have access to the document type declaration, it may not be easy to process that information. Simple declarations that just identify a root element and external subset of the DTD generally provide you with a reliable identifier of document type. More complex declarations that include an internal subset may trip you up by assembling a modular DTD on the fly or overriding and extending declarations from the external DTD. This can get complex. In simple cases, it's not bad, but in cases with an internal subset, it can be difficult to work with. >2) Try and look for a link to a XML schema? If schemas were ready for prime-time... at least I haven't seen internal subset proposals for schemas. >3) Just start walking the tree looking for particular nodes in a particular order? This is the most accurate, but also the most costly. Especially if you're expecting to find a lot of documents you don't plan to actually use in your list of URLs, you can churn through processor cycles and discard the results. I took a stab at describing document classes a few weeks ago, creating a fairly simple spec called XPDL, for XML Processing Description Language. Details are at http://purl.oclc.org/NET/xpdl. That might solve a lot of your problems, but only if people actually used it in their files. I'd love to hear about other approaches people are taking to this problem... Simon St.Laurent XML: A Primer / Building XML Applications Inside XML DTDs: Scientific and Technical (July) Sharing Bandwidth / Cookies http://www.simonstl.com xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|