[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: converting 1-20 GB xml to xsd, visualizing on webpage

  • From: "Michael Kay" <mike@s...>
  • To: "'Steven J. DeRose'" <sderose@a...>,"'Farkas, Illes'" <illes.farkas@g...>,<xml-dev@l...>
  • Date: Mon, 20 Oct 2008 20:04:21 +0100

RE:  converting 1-20 GB xml to xsd
Title: Re: converting 1-20 GB xml to xsd, visualizing o
Actually there are quite a few tools that do a reasonable job of constructing a schema from an XML document (e.g. in Stylus Studio, XML Spy etc); I suspect the main reason that stops the OP user using most of them is the size of the input. That's why I suggested the Saxon DTDGenerator, which is fully streamed.
 
Clearly there's more than one possible schema/DTD that could be produced, but most of these tools do quite a good job at recognizing common patterns. For example if every occurrence of element E has children that are a subsequence of  P*Q*R*S*T*, then the tool will usually be able to generate a content model for E that consists of the sequence PQRST with appropriate occurrence indicators. In the case of the Saxon DTDGenerator, if it finds one instance where the children are PQR and another where they are RQP, then it generates the content model (P|Q|R)*. It's all very hueristic and the results will never be perfect, but it's surprising how often they come reasonably close to the DTD that you would have written by hand.
 
Michael Kay
http://www.saxonica.com/


From: Steven J. DeRose [mailto:sderose@a...]
Sent: 20 October 2008 18:56
To: Farkas, Illes; xml-dev@l...
Subject: Re: converting 1-20 GB xml to xsd, visualizing on webpage

There are infinitely many schemas that will match any given set of data, so there is no single schema to extract....

If you just want *some* schema under which the document(s) is(are) valid, you could just extract all the element types that occur, say with something like

     grep -o '<[-_.:a-zA-Z0-9]* ' documentname.xml | sort | uniq

and then create a declaration for each one using some global changes in an editor, that allows each one to have unrestricted content. You'd need to do something similar for attributes, but then it should all validate.

If you want a little more information so you can build a more detailed schema, my xmlstats utility (in Perl) at http://derose.net/steve/utilities/xmlstats has options to tell you what element types occur within what other ones, and from that you could derive a more restrictive schema. The most obvious one would be to declare each element to permit the OR of all the element types that ever occur in it; that misses useful restrictions, such as for example that TITLE must occur only once in each DIV, and be the first child element; but it's better than just ANY for everything.

There used to be some nice utilities for extracting a reasonable DTD from SGML documents; perhaps someone has one handy for XML?

Steve

At 7:24 AM +0200 10/20/08, Farkas, Illes wrote:
Dear List Members,

Do you happen to know of a linux tool (or tools) that can extract from a 1-20 GB xml file its schema and visualize for users similarly to this page: http://psidev.sourceforge.net/mi/rel25/doc/

Thanks in advance,
Illes Farkas, Ph.D.
http://angel.elte.hu/fij


-- 

Steve DeRose -- http://www.derose.net, email sderose@a...


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.