Fwd: Data versioning strategy: address semantic, relationship,

From: "Greg Hunt" <greg@f...>
To: xml-dev@l...
Date: Mon, 10 Dec 2007 06:47:07 +1100

Play the video

Roger,
I think that you need to look at some other things, semantics, structure and syntax are at too low a level because useful version management needs to be embedded in a business process or a set of business agreements. The real question is how do we identify breaking and non-breaking changes? And then: how do we embed that identification in a change management process that minimises pain? For simple exchanges we can nail down the purposes that people put the data to fairly easily and change management is straight-forward. For more data exchanges involving multiple parties and multiple uses (I am thinking of some operational/statistical exchanges between Government agencies here), it is much, much harder. The agency concerns do not overlap neatly.

We might distinguish between strong and weak change management in this context, strong being highly specified change management and weak relying on human inspection and thought. Most of what follows addresses strong change management.

Change management requires constraints on users

For a versioning strategy to work there need to be constraints on the consumers (how do they extract element values, what aspects of the schema or message structure are they sensitive to), once that is done, then you can start to specify what a breaking or non-breaking change is. For example I have seen code that simply walks through the DOM nodes of a document, extracting element values as it goes, making assumptions about the oder and types of nodes. For that code there is no technical/structural change that is a non-breaking change because even whitespace is significant to that consumer.

On Meaning: Certain meaning is harder than "good enough", "good enough" breaks in surprising ways

Technical breakage is one thing, semantic breakage is another. As an industry and despite a lot of people going on about semantic markup and ontologies we tend to underestimate just how fuzzy the terms that we use are. Too many ontologies deal only with simply and sharply defined nouns, many business processes deal with things that are difficult to pin down really precisely because they are fuzzy around the edges because of either: rapid change, limited knowledge or because we simply do not care about the fine details of the definition. An example of the not caring is the common and unthinking use of postal delivery geography from ISO 3166 as a set of country codes - Bouvet Island, permanent human population zero is a country, Scotland with its own parliament and historical identity is not? Given that, do we really know what country codes mean? There are other more business context specific examples: where do order prices come from? what is a bread order? what exactly is delivery? Can you tell that I am not wild about your statement that semantics can be defined in a data dictionary? I am not sure that we can really pin down meaning in a complete way, but we can get to "good enough" without too much difficulty. The problem is that "good enough" must be checked and renegotiated whenever the world changes. For example consider what happens when you try to use ISO 3166 for country of birth (I've seen it done). Any idea how many distinct political entities there have been called Lithuania in the last 100 years? Is East Germany the same as the Federal Republic of Germany? What do we do with decomposition, like Yugoslavia, where there is no simple mapping between the aggregate and the current set of entities.

Consumers use data for different purposes, defining the purposes is difficult beyond very small numbers of consumers of the data

A semantically non-breaking change for one class of consumer might present problems for another. Consider a statistical data flow with a number of elements in it that are not summed (eg a structure containing a count of heart attacks, count of ambulance movements and a textual status report). On the face of it, in semantic terms adding another statistical element for morbidity should not be a problem if the element can be ignored. However, someone out there will eventually try to count instances of morbidity statistics. If the semantics is like a set of Russian dolls, where do you stop?

Some thoughts about semantic operations - an ontology of purposes?

If we are going to try to manage semantic change, we need to address the scope of the semantics beyond dictionary definitions. There are operations that are based in semantics. For example are the structures that makes up a document countable, summable or comparable inside and between instances of the document? Countable meaning whether the number of instances of the "thing" has any meaning at all. Summable meaning whether two instances of the element can be combined in some way, comparable meaning whether two instances of an element can be compared (comparison by name? comparison by structure and name?). Are two instances of an address structure comparable if they have different structure versions? That depends on the intent of the comparer. Addresses have a number of purposes and a change may only impact one purpose. Adding a postal delivery point ID to a physical adress used for legal service is likely (only likely, not guaranteed) to have no effect at all on the service purpose. If these types of operations are defined then the impact of a change can be more clearly specified.

Is an ontology of purposes possible?

For a strong versioning strategy/change managment strategy to work, we need an ontology that is tied to the document structure so that we can minimse ambiguity. For this element, what guarantees can we make and what operations are supported? What operations will we support? If we merge a postal address and a physical address (because they are identical), are we allowed to count address elements or do we have to count the number of purposes that the elements are put to? Is this possible at all?

Versioning - the original question

Its not a verioning strategy that is needed. We can attach some kind of version identifier, do stuff to make the versions identifiable and to an extent backward compatible, but the problem is the change management strategy.

Can we identify change that has an impact? For some purposes we can, but in non-trivial cases we can never be really sure that we have captured the definition of a significant change.

Are the distinctions between the types of change significant? I suspect that in reality they are not. They will all bite in interesting ways. We can minimise the amount of breaking change through various techniques, but those techniques are like those applied to object models - if you get it right it works really well, if you mistake the direction of change you have a big problem. The XML tool sets that we have make responding to breaking change a bit easier, they are not guaranteed to make it simple and it probably should in any case not be transparent.

Greg

On 12/8/07, Costello, Roger L. < costello@m...> wrote:

Hi Folks,

Oftentimes when discussing a "versioning strategy" I focus on how to
design schemas in a fashion to lessen the impact of changes.  It occurs
to me that this addresses only one aspect of the data versioning
problem.  Below I have attempted to identify other issues to be
addressed in a data versioning strategy.  I am interested in hearing
your thoughts on this.

EVOLVING DATA

Suppose some data is regularly exchanged between machines:

Machine 1 --> data --> Machine 2
Machine 1 <-- data <-- Machine 2

Periodically the data changes due to requirement changes, additional
insights, or from innovation.

A change results in a new "version" of the data.

PROBLEM

What are the categories of changes that may occur?  What categories of
changes must be dealt with by a data versioning strategy?

CATEGORIES OF CHANGE

1. Semantic - the meaning of the data changes.

Example:

version 1 data: a "distance" value means the distance from the center
of town.

version 2 data: a distance value means the distance from the town line.

2. Relationship - the relationship between the data changes.

Example:

version 1 data: there is a co-constraint between the start-time and the
end-time.

version 2 data: there is a three-way co-constraint between start-time,
end-time, and mode-of-transportation.

3. Syntax - the structure of the data changes.

Example:

version 1 data: the employee data is listed first and the person's name
is given by his given-name and surname.

version 2 data: the department data is listed first and in the employee
data each person's name additionally contains a middle name.

SUPPORTING TECHNOLOGIES

Suppose the data being exchanged is formatted using the XML syntax.

Machine 1 --> XML --> Machine 2
Machine 1 <-- XML <-- Machine 2

What technologies support the above categories of change?

1. Semantic: A data dictionary may be used to define meaning.

2. Relationship: Schematron may be used to express relationships
between data.

3. Syntax: XML Schema, Relax NG, or DTD may be used to express the
structure of the data.

REQUIREMENTS ON A VERSIONING STRATEGY

A versioning strategy must take into consideration:

- changes in the semantics of the data
- changes in the relationships of the data
- changes in the syntax of the data

When data is in an XML format then a versioning strategy must
implement:

- versioning a data dictionary
- versioning a Schematron schema
- versioning an XML Schema, Relax NG schema, or DTD

QUESTIONS

a. Do you agree with the three categories of change?

b. Do these categories represent all types of change?

c. Do you agree that a versioning strategy must address semantic,
relationship, and syntactic changes?

/Roger

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@l...
subscribe: xml-dev-subscribe@l...
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- RE: Data versioning strategy: address semantic,relationship, and syntactic changes?
  - From: "Cox, Bruce" <Bruce.Cox@U...>

References:
- Data versioning strategy: address semantic, relationship, and syntactic changes?
  - From: "Costello, Roger L." <costello@m...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >