[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: converting 1-20 GB xml to xsd, visualizing on webpage

  • From: Chin Chee-Kai <cheekai@s...>
  • To: Michael Kay <mike@s...>
  • Date: Tue, 21 Oct 2008 23:18:20 +0800

Re:  converting 1-20 GB xml to xsd
So you didn't truncate your explanation in the original post but actually meant it... then I shall delve further :)

Basically it all stems from what strategy you take: whether simplicity comes first or right result comes first. 

In the extreme, the simplest form  is a schema that accepts everything (.)* , which is pretty useless.  The other extreme is a schema which rightfully accepts only that single instance (PQR|RQP), which is probably useless most of the time as well.  Other strategies straddle in between, pleasing some users and driving others crazy.  But as you said, working with only one instance is just not the  best way to extrapolate or make assumptions about the "kinds" of instance that instance's author wants in general.  Still, if only one instance is all that we have to work with,  a user receiving the generated schema might need to hand-make another schema to prune out PRQ, RPQ, QPR & QRP, which is twice the amount of work had he started with a constructive assembly of schema for PQR and RQP.  (as you might note, the combinatorial factor exponentiates with more siblings)

On streaming mechanism, I find it rather bold that when the DTD generator  sees PQR, it assumes (P|Q|R)* right away and forgets about PQR (as it needs to conserve memory), then hoping in future to find RQP, PRQ, RPQ, QPR & QRP.  In the instance under discussion, the DTD generator finds RQP and happily lives with the decision of (P|Q|R)* ever after.  Can't say it is right or wrong as it is a means of helping the user identify potential pattern, which might just be the right pattern after all.   Still, I find it rather bold...

regards,
Chin Chee-Kai


Michael Kay wrote:
328FCA5800394E8F8F327A595DDC7AD3@Sealion" type="cite">
 
Michael Kay wrote:
1C58F5C150C343ACBF2819C5B466B958@Sealion" type="cite">
In the case of the Saxon DTDGenerator, if it finds one instance where the children are PQR and another where they are RQP, then it generates the content model (P|Q|R)*.
Wouldn't (P|Q|R)* accept PQR, RQP, and along with the not-necessarily-acceptable PRQ, RPQ, QPR and QRP?
I suppose you're just giving a quick description in the above? 

Granted, it is difficult to fathom the intent of the creator from just one instance, the most a heuristic can conclude without risking over-accepting potentially unwanted patterns would just be ((P|R)Q(P|R))*.

 
Well, of course, the aim of a tool like this is to find the "best" pattern that matches all the instances available; and that's a completely open-ended task. If you only have a small number of instances (two in the example above) then guessing the "right" pattern is almost impossible, and on the whole I learnt from doing this that it's probably better to produce a pattern that is as simple as possible in preference to one that is the closest possible fit to the available instances. But of course there is no single right answer: you're working with incomplete information.
 
The Saxon tool works in streaming mode (which is important to this user) and that imposes additional constraints; it means that you can't remember all the instances that you have encountered. The strategy is to guess a content model from the first instance and then refine it as further instances are found, and because you haven't remembered details of all the instances, the only way you can refine it is to replace it by a pattern that subsumes the previous pattern. But as I said before, for most inputs the results are surprisingly close to the content model that a human author would have written.
 
Michael Kay
http://www.saxonica.com/


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.