[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: How to avoid (minimize) errors due to copying, pasting,and

  • From: Rick Jelliffe <rjelliffe@allette.com.au>
  • To: Hans-Juergen Rennau <hrennau@yahoo.de>
  • Date: Thu, 24 May 2018 02:07:58 +1000

Re:  How to avoid (minimize) errors due to copying
You problem is how to enforce the desired invariants between a source and target data set.

I have used Schematron for this on several large (hundreds of thousands) of documents. For example, one invariant might be that the source and target documents have the same number of headings.  

Or that every unique hexagram made of the last 3 visible characters in a paragraph plus the first three of the next immediately following paragraph of the source should also be found in the target: this checks dropped or reordered paras even if the transformation has restructured branches. These problems are the bane of ETL tools used for XML. Serious problems.

The first thing is classify the severity and likelihood if known: SEV1, SEV2 etc.  Then write tests: a high severity needs good tests, but an unlikely low risk may just need a canary in a cage: alloeing false positives for simplicity. 
Detetmibe what failure rate is acceptable (e.g., 0 x SEV 1, 0 xSEV2, 1 xSEV3 error per 100 documents,)

If you have a large corpus, decide how many to look at: select randomly. If you have fewer than 10,000 documents in the source corpus, test them all. If you have more than 10,000 documents, randomly select 10,000 to process and check.  (Or use a stats calculator to see the kinds of sizes you need for 99.9% confidence etc.) 

Anothet way is to round-trip the document back to your source format, then compare the difference. That makes the assertions easier, but the transform might be harder, and it deoends on what information is lost up or down.


On Wed, 23 May 2018, 17:08 Hans-Juergen Rennau, <hrennau@yahoo.de> wrote:
Hi Roger, dividing the problem into creating and checking resources, and focusing on the second, I think the magic word is *structured information*. Unfortunately, the awareness of structured information and their potential usefulness is very low. Or let me be more precise: the awareness of chances to use structured information creatively, spontaneously, inventively, in response to you needs of quality assurance, rather than along the trodden and obvious paths.

To illustrate the thought: imagine a specification written in docbook, and a CSV file compiling some data paths in the second column. The following XQuery (using an extension function offered by BaseX)

let $pathExpected := unparsed-text('paths.csv') ! csv:parse(.)//record/entry[2]
let $pathFound := doc("rethinking13.xml")/descendant::*:table[@xml:id eq 'paths']//*:row/*:entry[1]/string()
return $pathExpected[not(. = $pathFound)]/string()

gives me all paths found in the CSV, but forgotten in the docbook table. I do not think many people would have recognized this possibility, although there is a docbook file and a CSV file. So part one of an attempt at an answer is: SEE the structured information which is there.

While part 2 is: ADD it, where it isn't.

The rest is XQuery, or any other language speaking structured information as found in resources, natively.

With kind regards,

Am Donnerstag, 17. Mai 2018, 13:59:41 MESZ hat Costello, Roger L. <costello@mitre.org> Folgendes geschrieben:

Hi Folks,

I am working on a project that has created a large, complex data specification. There are tables in the data specification, from which I created Schematron rules. The tables specify a bunch of codes. When I created the Schematron rules, I accidentally missed some of the codes. I discovered this omission only after considerable effort and expense.

It got to thinking about all the other places along the path to creating the data specification where data might have accidentally been dropped, altered, added, or put in the wrong place. I don't know, but I suspect the data specification was produced something like this: several subject matter experts jotted down some ideas on a piece of paper and handed it to another person who typed up their ideas. [Potential for errors at this step] The typed document then goes to a publication office which typesets and officially publishes the data specification. [Potential for errors at this step] Then, of course people use the data specification in their own endeavors, which provides more opportunities where errors may be introduced.

It occurs to me that quite possibly lots of errors are due to simple human errors from copying, pasting, transcribing. How to avoid this?


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.