[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: It is okay for things to break in the future!

  • From: Rick Jelliffe <rjelliffe@allette.com.au>
  • To: xml-dev <xml-dev@lists.xml.org>
  • Date: Mon, 30 Jan 2023 15:54:51 +1100

Re:  It is okay for things to break in the future!
I tend to agree with John.

We need to factor in the likely SDLC for our schemas and modeling.

In particular, if you are a multinational company or a company that will likely merge with other companies or buy in data, you can be pretty sure that the structure you derive from your document analysis will be revealed as incorrect/inappropriate/bogus as soon as it is faced with these new kinds of data.

Therefore you need to be much clearer in what is vocabulary (e.g. element names),  semantically necessary structure (e.g. that table cell only occurs in a row and that in a table) and document constraints (e.g. that a section starts with one title, or that an address has one ZIP code at the end).

Failure to model these separately (e.g. by an open and loose base schema for the first two, and derived schemas or Schematrons for the last) causes extra work for later integration: indeed, sometimes this is compounded by management, embarrassed that the new documents could not be shoe-horned into the standard schemas that so much effort had been spent on to model, blaming the poor suckers who have to do this shoe-horning, rather than attributing it to a kind of failure in awareness of the SDLC.  

Regards
Rick

On Sat, Jan 28, 2023 at 12:12 PM John Cowan <johnwcowan@gmail.com> wrote:


On Sun, Sep 4, 2022 at 6:11 PM Roger L Costello <costello@m...> wrote:
 
Roger's Perspective: It is possible to know the current world. Developers can and should model the current world. The benefits of flagging data that violates the model outweighs the benefits of "coding for the future."

I Guess Everyone Else's Perspective: It is not possible to model the world. Even in incredibly simple ways. The costs of breaking the model when the world doesn’t agree with the model outweighs the benefits of flagging invalid data.

I wouldn't put it that way at all.  It's possible to model the world, and we do, all the time.  But we always do so on the basis of insufficient data.  At Lexis-Nexis, the 1-billion-document company, modelers would typically ask for a sample of documents (already very roughly XMLized) from which an XML Schema would be built.  It turned out that even a few hundred documents of a given type (too many to examine individually) was not enough to capture all possible structural features, never mind refinements like maximum length.  So we turned up the knob and started to ask for thousands or tens of thousands of documents and used some simple-minded software (which I wrote) to look at and count features and to determine which features were subordinate to which other features.

Essentially the second perspective is a warning against overfitting <https://en.wikipedia.org/wiki/Overfitting>.  If we see that in our sample (and that's all we ever have, a sample) the longest given name ("first name", though it isn't always first) is 17 characters long, we probably don't want to introduce a constraint saying "maxlength(firstname) = 17".  As the WP article says, "The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure."


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.