Re: A standard approach to glueing together reusable XML frag

To: Bruce.Cox@U...
Subject: Re: A standard approach to glueing together reusable XML fragments in prose?
From: "Chiusano Joseph" <chiusano_joseph@b...>
Date: Fri, 22 Aug 2003 08:56:35 -0400
Cc: xml-dev@l...
Organization: Booz Allen Hamilton
References: <86A9A7F2AFF74941A3885A174686E48D0D9158F1@u...>

Play the video

<Quote>
For us, then, it is unlikely that there ever would be a practical
application for reusable content on anything other than a fairly small
scale.
</Quote>

Yes - and I'll bet there would be a high value in reusable metadata -
e.g. schemas - for patent specifications.

Kind Regards,
Joe Chiusano
Booz | Allen | Hamilton

Bruce.Cox@U... wrote:
> 
> As a rule, there is little or no reusable content in patent specifications.
> Not surprising, since they are *supposed* to be unique.  There is reusable
> content in many of the publications produced by the USPTO that explain how
> to file a patent, etc., but there are only a few dozen of such documents as
> opposed to about 6.5 million published patent grants.  (Only about half of
> those are available as text, starting in the 1970's, and only those
> published since 1999-04-13 are available as SGML/XML.  If we convert the
> backfile to our current XML DTD, we expect to need no more than a few
> variations of the DTD to accommodate differences in publishing practice over
> the period 1790 to the present.)
> 
> Next year, we will begin developing means to process patent applications and
> correspondence with applicants in XML.  The current application backlog is
> about 500,000, and with a minimum of say, four or five messages, the number
> of transactions is fairly large.  Here, there is reusable content (some few
> hundred "form paragraphs") that examiners pick from cascading menus,
> depending on the nature of the correspondence.  (This is not random letter
> writing, but highly ritualized gesture based on statute, rules, and past
> litigation.)  Once the correspondence is sent, however, it is static, and
> never changes.  The same is true for published grants and published
> applications, that is, they are static.
> 
> As for searching, we use OpenText's BRS Search (does not support XML at
> present).
> 
> For us, then, it is unlikely that there ever would be a practical
> application for reusable content on anything other than a fairly small
> scale.
> 
> Bruce B. Cox
> SA4XMLT
> USPTO/OCIO/AETS
> 703-306-2606
> 
> -----Original Message-----
> From: dbexcom [mailto:lbradshaw@d...]
> Sent: Tuesday, August 19, 2003 11:47 AM
> To: mitch.amiano@s...; xml-dev@l...
> Subject: Re:  A standard approach to glueing together reusable XML
> fragments in prose?
> 
> I am concerned to hear this approach, and others here, discussed, without
> comment as to scaling issues regarding very large datastores (in XML
> documents or in relational dbms) that might be ten to several hundred
> terabytes in size.
> 
> Specifically, in the following respects:
> 1- sheer size problems such as disk access time, out of memory conditions,
> and processor time to parse very large XML documents (say, 1,000 documents
> of 1 terabyte each) or a very large number of XML documents of smaller size
> (say, 5,000,000 5MB docs).
> 2- maintenance issues driven by the smallest of interface changes or
> presentation changes, that result in hundred of thousands if not millions of
> manual static schema modifications, rippling across either a very large
> number of smaller XML documents and their specific schemas or through as
> many as a thousand or so documents of 1 terabyte each in size. Even if such
> ripple effect maintenance can be automated, the processing time required to
> update, say,  5,000,000 XML doc files of 5MB each cannot be said to be real
> time, so perhaps weeks of processing time is required before the interface
> mods can be subject to just one full test.
> 3- consistency across versions, releases, XML standards and tool sets (MS,
> SQL Server, MySQL, Oracle, etc) considering that a very large scale project
> will take some time to mature (possibly years), and that a lack of backward
> compatibility could drive massive changes into the basic XML design
> structure and overall document architecture.
> 4- transmission time across interchanges - whether lan, web or intranet
> based, the time to transmit and parse result sets to XQuery are often very
> large, and for very large XML documents this processing time is unacceptably
> long. People want results in five to eleven seconds, not minutes, not hours.
> 
> I have specific experience in very large paper based, and relational
> database systems. From time to time, I see folks scale up systems that work
> fine, up to a point, past which they are forced to redesign from scratch.
> 
> While I agree that broadly generalized discussions are the most common form
> of technical exchange of information, having seen several of these pilot
> efforts crash and burn, I feel a moral obligation to suggest that some
> comment be made as to scaling issues, known propagation or ripple effects,
> and sheer size problems that come into play when viable "average"
> architectures are scaled beyond their design parameters.
> 
> In reference to this specific method, I submit that when dealing with a very
> large repository of prose, that a very large number of "profile documents"
> is possible, and that the number of possible "profile documents"
> correlates to some index of the context and the subject matter and the usage
> purposes (inquiry / result pairs), a result that to my mind increases or
> scales up as the number of prose entities scales up. I will go further and
> say that, for instance, for all articles ever published in the scientific
> journal "Nature", or perhaps all items in the U.S. Library of Congress or
> all pending applications and issued patent files in the U.S.
> Patent Office, this number of possible "profile documents" becomes very
> large indeed. Though it may be possible to satisfy as much as a majority if
> inquiries with a small number of such structures, the rest of the inquiries,
> it seems to me, will require an ever increasing number of "profile
> documents" to satisfy so that satisfying the last 1 percent of such
> inquiries might require several thousands of such "profile documents", if
> not tens of thousands or hundreds of thousands.
> 
> So, I am interested to hear about practical applications using XML only
> implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide ranging
> subject matter, such as is found  in the scientific journal "Nature", or
> perhaps all items in the U.S. Library of Congress or all pending
> applications and issued patent files in the U.S. Patent Office, to a very
> broad audience, across scientific disciplines and cultures (and possibly
> languages), for a very large data repository of mixed content (prose,
> graphics, slides, photos, video, sound, other streaming data sources or
> media) measured in tens or hundreds of terabytes.
> 
> While XML is superb at document mark up, in my experience almost as good as
> TeX, it does not strike me as the best tool for the job when dealing with
> very large scale data repositories. Still, I have an open mind and perhaps
> someone here can enlighten me.
> 
> Thank you.
> 
> At 10:28 PM 8/18/2003 -0400, you wrote:
> >One of the difficulties in considering factoring out functionally
> >dependent entities from prose, is that the block of prose may itself
> >not be worth reusing. That is, the prose may be a one-shot document
> >whose original intent is simply to present information, not to act as a
> >reliable container for access by clients with a variety of intents.
> >One thing I've done is to try to identify those concepts which are best
> >understood, are most firmly established, and which serve as the focus
> >of the stakeholders' activities and communications.  Then design a
> >profile document for each of these high-level concepts, which provide
> >context for making pointers and for generating identifiers. The
> >profiles are designed to provide some elements which are rigidly
> >structured, and other elements which are prose with mixed content. In
> >one case at least, this allowed me (with a stylesheet) to resolve most
> >cross references internal to the document itself, minimizing calls to
> >scan external documents. Also, depending upon the nature of your data
> >and your validation techniques, you may be able to use the mixed
> >content prose as the source of the definitive information, rather than just
> as glue.
> >It is certainly something a good CMS can help with, but I've also used
> >DSSSL and XSLT/XPath for doing just this sort of thing with reasonable
> >results. You might also want to check out DITA by Michael Priestley et al.
> >of IBM, which I think intends to facilitate topical reuse.
> >
> >Roger L. Costello wrote:
> >
> >>Hi Folks,
> >>I am working with some people who wish to migrate from an all-prose
> >>format to a prose-plus-reusable-XML-fragments format.
> >>They have some data in prose that is useable in many contexts.  They
> >>want to break out that reusable data  into XML fragments.  However,
> >>they want to continue to provide the prose style.
> >>For example, consider this prose data:
> >><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city
> >>with many attractions.  Miami Beach is a popular attraction.  The
> >>spring tide is ... The neap tide is ... </para> Examining this prose
> >>we can extract reusable info about the city of
> >>Miami:
> >><City id="Miami">
> >>     <state>Florida</state>
> >>     <population>1,234,000</population>
> >></City>
> >>We can also extract reusable info about tide data on Miami Beach:
> >><TideData id="MiamiBeachTides">
> >>     <springTide>...</springTide>
> >>     <neapTide>...</neapTide>
> >></TideData>
> >>The problem now is to create a framework which allows the prose to
> >>bring-together the independent, reusable XML components.
> >>Conceptually, what is desired is a "glue framework" like this:
> >><para>The <ref href="Miami.xml"> is a sprawling city with many
> >>attractions.  Miami Beach is a popular attraction.  The tides are <ref
> >>href="MiamiBeachTides.xml"><para>
> >>Thus, the prose is "glueing" together the XML fragments.
> >>Is this a problem that you have experience with?  What  "glue
> >>framework" have you used?  What strategy did you use to merge the XML
> >>fragments with the prose?  Is there is a standard way of combining
> >>semi-structured data with structured data?
> >>/Roger
> >>
> >>-----------------------------------------------------------------
> >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >>initiative of OASIS <http://www.oasis-open.org> The list archives are
> >>at http://lists.xml.org/archives/xml-dev/
> >>To subscribe or unsubscribe from this list use the subscription
> >>manager: <http://lists.xml.org/ob/adm.pl>
> >
> >
> >
> >-----------------------------------------------------------------
> >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >initiative of OASIS <http://www.oasis-open.org>
> >
> >The list archives are at http://lists.xml.org/archives/xml-dev/
> >
> >To subscribe or unsubscribe from this list use the subscription
> >manager: <http://lists.xml.org/ob/adm.pl>
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

begin:vcard 
n:Chiusano;Joseph
tel;work:(703) 902-6923
x-mozilla-html:FALSE
url:www.bah.com
org:Booz | Allen | Hamilton;IT Digital Strategies Team
adr:;;8283 Greensboro Drive;McLean;VA;22012;
version:2.1
email;internet:chiusano_joseph@b...
title:Senior Consultant
fn:Joseph M. Chiusano
end:vcard

References:
- RE: A standard approach to glueing together reusable XML fragments in prose?
  - From: Bruce.Cox@U...

Prev by Date: Re: The Granularity of Markup (Re: InkML)
Next by Date: Re: InkML
Previous by thread: RE: A standard approach to glueing together reusable XML fragments in prose?
Next by thread: CFP: XML Database Symposium (XSym03) @ VLDB 2003
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >