[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: A standard approach to glueing together reusable XML frag
<Quote> For us, then, it is unlikely that there ever would be a practical application for reusable content on anything other than a fairly small scale. </Quote> Yes - and I'll bet there would be a high value in reusable metadata - e.g. schemas - for patent specifications. Kind Regards, Joe Chiusano Booz | Allen | Hamilton Bruce.Cox@U... wrote: > > As a rule, there is little or no reusable content in patent specifications. > Not surprising, since they are *supposed* to be unique. There is reusable > content in many of the publications produced by the USPTO that explain how > to file a patent, etc., but there are only a few dozen of such documents as > opposed to about 6.5 million published patent grants. (Only about half of > those are available as text, starting in the 1970's, and only those > published since 1999-04-13 are available as SGML/XML. If we convert the > backfile to our current XML DTD, we expect to need no more than a few > variations of the DTD to accommodate differences in publishing practice over > the period 1790 to the present.) > > Next year, we will begin developing means to process patent applications and > correspondence with applicants in XML. The current application backlog is > about 500,000, and with a minimum of say, four or five messages, the number > of transactions is fairly large. Here, there is reusable content (some few > hundred "form paragraphs") that examiners pick from cascading menus, > depending on the nature of the correspondence. (This is not random letter > writing, but highly ritualized gesture based on statute, rules, and past > litigation.) Once the correspondence is sent, however, it is static, and > never changes. The same is true for published grants and published > applications, that is, they are static. > > As for searching, we use OpenText's BRS Search (does not support XML at > present). > > For us, then, it is unlikely that there ever would be a practical > application for reusable content on anything other than a fairly small > scale. > > Bruce B. Cox > SA4XMLT > USPTO/OCIO/AETS > 703-306-2606 > > -----Original Message----- > From: dbexcom [mailto:lbradshaw@d...] > Sent: Tuesday, August 19, 2003 11:47 AM > To: mitch.amiano@s...; xml-dev@l... > Subject: Re: A standard approach to glueing together reusable XML > fragments in prose? > > I am concerned to hear this approach, and others here, discussed, without > comment as to scaling issues regarding very large datastores (in XML > documents or in relational dbms) that might be ten to several hundred > terabytes in size. > > Specifically, in the following respects: > 1- sheer size problems such as disk access time, out of memory conditions, > and processor time to parse very large XML documents (say, 1,000 documents > of 1 terabyte each) or a very large number of XML documents of smaller size > (say, 5,000,000 5MB docs). > 2- maintenance issues driven by the smallest of interface changes or > presentation changes, that result in hundred of thousands if not millions of > manual static schema modifications, rippling across either a very large > number of smaller XML documents and their specific schemas or through as > many as a thousand or so documents of 1 terabyte each in size. Even if such > ripple effect maintenance can be automated, the processing time required to > update, say, 5,000,000 XML doc files of 5MB each cannot be said to be real > time, so perhaps weeks of processing time is required before the interface > mods can be subject to just one full test. > 3- consistency across versions, releases, XML standards and tool sets (MS, > SQL Server, MySQL, Oracle, etc) considering that a very large scale project > will take some time to mature (possibly years), and that a lack of backward > compatibility could drive massive changes into the basic XML design > structure and overall document architecture. > 4- transmission time across interchanges - whether lan, web or intranet > based, the time to transmit and parse result sets to XQuery are often very > large, and for very large XML documents this processing time is unacceptably > long. People want results in five to eleven seconds, not minutes, not hours. > > I have specific experience in very large paper based, and relational > database systems. From time to time, I see folks scale up systems that work > fine, up to a point, past which they are forced to redesign from scratch. > > While I agree that broadly generalized discussions are the most common form > of technical exchange of information, having seen several of these pilot > efforts crash and burn, I feel a moral obligation to suggest that some > comment be made as to scaling issues, known propagation or ripple effects, > and sheer size problems that come into play when viable "average" > architectures are scaled beyond their design parameters. > > In reference to this specific method, I submit that when dealing with a very > large repository of prose, that a very large number of "profile documents" > is possible, and that the number of possible "profile documents" > correlates to some index of the context and the subject matter and the usage > purposes (inquiry / result pairs), a result that to my mind increases or > scales up as the number of prose entities scales up. I will go further and > say that, for instance, for all articles ever published in the scientific > journal "Nature", or perhaps all items in the U.S. Library of Congress or > all pending applications and issued patent files in the U.S. > Patent Office, this number of possible "profile documents" becomes very > large indeed. Though it may be possible to satisfy as much as a majority if > inquiries with a small number of such structures, the rest of the inquiries, > it seems to me, will require an ever increasing number of "profile > documents" to satisfy so that satisfying the last 1 percent of such > inquiries might require several thousands of such "profile documents", if > not tens of thousands or hundreds of thousands. > > So, I am interested to hear about practical applications using XML only > implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide ranging > subject matter, such as is found in the scientific journal "Nature", or > perhaps all items in the U.S. Library of Congress or all pending > applications and issued patent files in the U.S. Patent Office, to a very > broad audience, across scientific disciplines and cultures (and possibly > languages), for a very large data repository of mixed content (prose, > graphics, slides, photos, video, sound, other streaming data sources or > media) measured in tens or hundreds of terabytes. > > While XML is superb at document mark up, in my experience almost as good as > TeX, it does not strike me as the best tool for the job when dealing with > very large scale data repositories. Still, I have an open mind and perhaps > someone here can enlighten me. > > Thank you. > > At 10:28 PM 8/18/2003 -0400, you wrote: > >One of the difficulties in considering factoring out functionally > >dependent entities from prose, is that the block of prose may itself > >not be worth reusing. That is, the prose may be a one-shot document > >whose original intent is simply to present information, not to act as a > >reliable container for access by clients with a variety of intents. > >One thing I've done is to try to identify those concepts which are best > >understood, are most firmly established, and which serve as the focus > >of the stakeholders' activities and communications. Then design a > >profile document for each of these high-level concepts, which provide > >context for making pointers and for generating identifiers. The > >profiles are designed to provide some elements which are rigidly > >structured, and other elements which are prose with mixed content. In > >one case at least, this allowed me (with a stylesheet) to resolve most > >cross references internal to the document itself, minimizing calls to > >scan external documents. Also, depending upon the nature of your data > >and your validation techniques, you may be able to use the mixed > >content prose as the source of the definitive information, rather than just > as glue. > >It is certainly something a good CMS can help with, but I've also used > >DSSSL and XSLT/XPath for doing just this sort of thing with reasonable > >results. You might also want to check out DITA by Michael Priestley et al. > >of IBM, which I think intends to facilitate topical reuse. > > > >Roger L. Costello wrote: > > > >>Hi Folks, > >>I am working with some people who wish to migrate from an all-prose > >>format to a prose-plus-reusable-XML-fragments format. > >>They have some data in prose that is useable in many contexts. They > >>want to break out that reusable data into XML fragments. However, > >>they want to continue to provide the prose style. > >>For example, consider this prose data: > >><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city > >>with many attractions. Miami Beach is a popular attraction. The > >>spring tide is ... The neap tide is ... </para> Examining this prose > >>we can extract reusable info about the city of > >>Miami: > >><City id="Miami"> > >> <state>Florida</state> > >> <population>1,234,000</population> > >></City> > >>We can also extract reusable info about tide data on Miami Beach: > >><TideData id="MiamiBeachTides"> > >> <springTide>...</springTide> > >> <neapTide>...</neapTide> > >></TideData> > >>The problem now is to create a framework which allows the prose to > >>bring-together the independent, reusable XML components. > >>Conceptually, what is desired is a "glue framework" like this: > >><para>The <ref href="Miami.xml"> is a sprawling city with many > >>attractions. Miami Beach is a popular attraction. The tides are <ref > >>href="MiamiBeachTides.xml"><para> > >>Thus, the prose is "glueing" together the XML fragments. > >>Is this a problem that you have experience with? What "glue > >>framework" have you used? What strategy did you use to merge the XML > >>fragments with the prose? Is there is a standard way of combining > >>semi-structured data with structured data? > >>/Roger > >> > >>----------------------------------------------------------------- > >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > >>initiative of OASIS <http://www.oasis-open.org> The list archives are > >>at http://lists.xml.org/archives/xml-dev/ > >>To subscribe or unsubscribe from this list use the subscription > >>manager: <http://lists.xml.org/ob/adm.pl> > > > > > > > >----------------------------------------------------------------- > >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > >initiative of OASIS <http://www.oasis-open.org> > > > >The list archives are at http://lists.xml.org/archives/xml-dev/ > > > >To subscribe or unsubscribe from this list use the subscription > >manager: <http://lists.xml.org/ob/adm.pl> > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl> begin:vcard n:Chiusano;Joseph tel;work:(703) 902-6923 x-mozilla-html:FALSE url:www.bah.com org:Booz | Allen | Hamilton;IT Digital Strategies Team adr:;;8283 Greensboro Drive;McLean;VA;22012; version:2.1 email;internet:chiusano_joseph@b... title:Senior Consultant fn:Joseph M. Chiusano end:vcard
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|