[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: A standard approach to glueing together reusableXML fragme
<Quote> <;)>that's why we have 64 bit processors</;)> </Quote> Still very much in infancy, but catching on [1]. Kind Regards, Joe Chiusano Booz | Allen | Hamilton [1] http://www.nwfusion.com/news/2003/071464bit.html Rick Marshall wrote: > > On Wed, 2003-08-20 at 09:08, Chiusano Joseph wrote: > > <Quote1> > > processor time to parse very large XML documents (say, 1,000 documents > > of 1 terabyte each) > > </Quote1> > > > > If one's XML documents are 1 terabyte large, then they better rethink > > their system architecture and design, and chop their documents up into > > smaller pieces. A 32-bit processor can itself address only up to 4GB of > > memory. > > <;)>that's why we have 64 bit processors</;)> > > > > > <Quote2> > > maintenance issues driven by the smallest of interface changes or > > presentation changes, that result in hundred of thousands if not > > millions of manual static schema modifications, rippling across either > > a very large number of smaller XML documents and their specific schemas > > or through as many as a thousand or so documents > > </Quote2> > > > > One should never have to perform "hundred of thousands if not millions > > of manual static schema modifications" - an XML registry and/or a robust > > content management system should enable updates to be made in one > > central location and propagated to all of the pertinent places (which > > reference the central location by pointers). This also addresses your #3 > > point. > > > > <Quote3> > > transmission time across interchanges - whether lan, web or intranet > > based, the time to transmit and parse result sets to XQuery are often > > very large, and for very large XML documents this processing time is > > unacceptably long. > > </Quote3> > > > > A very valid and well-known issue - and one of the reasons that some > > brainstorming over binary XML is going on these days. > > > > Kind Regards, > > Joe Chiusano > > Booz | Allen | Hamilton > > > > dbexcom wrote: > > > > > > I am concerned to hear this approach, and others here, discussed, without > > > comment as to scaling issues regarding very large datastores (in XML > > > documents or in relational dbms) that might be ten to several hundred > > > terabytes in size. > > > > > > Specifically, in the following respects: > > > 1- sheer size problems such as disk access time, out of memory conditions, > > > and processor time to parse very large XML documents (say, 1,000 documents > > > of 1 terabyte each) or a very large number of XML documents of smaller size > > > (say, 5,000,000 5MB docs). > > > 2- maintenance issues driven by the smallest of interface changes or > > > presentation changes, that result in hundred of thousands if not millions > > > of manual static schema modifications, rippling across either a very large > > > number of smaller XML documents and their specific schemas or through as > > > many as a thousand or so documents of 1 terabyte each in size. Even if such > > > ripple effect maintenance can be automated, the processing time required to > > > update, say, 5,000,000 XML doc files of 5MB each cannot be said to be real > > > time, so perhaps weeks of processing time is required before the interface > > > mods can be subject to just one full test. > > > 3- consistency across versions, releases, XML standards and tool sets (MS, > > > SQL Server, MySQL, Oracle, etc) considering that a very large scale project > > > will take some time to mature (possibly years), and that a lack of backward > > > compatibility could drive massive changes into the basic XML design > > > structure and overall document architecture. > > > 4- transmission time across interchanges - whether lan, web or intranet > > > based, the time to transmit and parse result sets to XQuery are often very > > > large, and for very large XML documents this processing time is > > > unacceptably long. People want results in five to eleven seconds, not > > > minutes, not hours. > > > > > > I have specific experience in very large paper based, and relational > > > database systems. From time to time, I see folks scale up systems that work > > > fine, up to a point, past which they are forced to redesign from scratch. > > > > > > While I agree that broadly generalized discussions are the most common form > > > of technical exchange of information, having seen several of these pilot > > > efforts crash and burn, I feel a moral obligation to suggest that some > > > comment be made as to scaling issues, known propagation or ripple effects, > > > and sheer size problems that come into play when viable "average" > > > architectures are scaled beyond their design parameters. > > > > > > In reference to this specific method, I submit that when dealing with a > > > very large repository of prose, that a very large number of "profile > > > documents" is possible, and that the number of possible "profile documents" > > > correlates to some index of the context and the subject matter and the > > > usage purposes (inquiry / result pairs), a result that to my mind increases > > > or scales up as the number of prose entities scales up. I will go further > > > and say that, for instance, for all articles ever published in the > > > scientific journal "Nature", or perhaps all items in the U.S. Library of > > > Congress or all pending applications and issued patent files in the U.S. > > > Patent Office, this number of possible "profile documents" becomes very > > > large indeed. Though it may be possible to satisfy as much as a majority if > > > inquiries with a small number of such structures, the rest of the > > > inquiries, it seems to me, will require an ever increasing number of > > > "profile documents" to satisfy so that satisfying the last 1 percent of > > > such inquiries might require several thousands of such "profile documents", > > > if not tens of thousands or hundreds of thousands. > > > > > > So, I am interested to hear about practical applications using XML only > > > implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide ranging > > > subject matter, such as is found in the scientific journal "Nature", or > > > perhaps all items in the U.S. Library of Congress or all pending > > > applications and issued patent files in the U.S. Patent Office, to a very > > > broad audience, across scientific disciplines and cultures (and possibly > > > languages), for a very large data repository of mixed content (prose, > > > graphics, slides, photos, video, sound, other streaming data sources or > > > media) measured in tens or hundreds of terabytes. > > > > > > While XML is superb at document mark up, in my experience almost as good as > > > TeX, it does not strike me as the best tool for the job when dealing with > > > very large scale data repositories. Still, I have an open mind and perhaps > > > someone here can enlighten me. > > > > > > Thank you. > > > > > > At 10:28 PM 8/18/2003 -0400, you wrote: > > > >One of the difficulties in considering factoring out functionally > > > >dependent entities from prose, is that the block of prose may itself not > > > >be worth reusing. That is, the prose may be a one-shot document whose > > > >original intent is simply to present information, not to act as a reliable > > > >container for access by clients with a variety of intents. > > > >One thing I've done is to try to identify those concepts which are best > > > >understood, are most firmly established, and which serve as the focus of > > > >the stakeholders' activities and communications. Then design a profile > > > >document for each of these high-level concepts, which provide context for > > > >making pointers and for generating identifiers. The profiles are designed > > > >to provide some elements which are rigidly structured, and other elements > > > >which are prose with mixed content. In one case at least, this allowed me > > > >(with a stylesheet) to resolve most cross references internal to the > > > >document itself, minimizing calls to scan external documents. Also, > > > >depending upon the nature of your data and your validation techniques, you > > > >may be able to use the mixed content prose as the source of the definitive > > > >information, rather than just as glue. > > > >It is certainly something a good CMS can help with, but I've also used > > > >DSSSL and XSLT/XPath for doing just this sort of thing with reasonable > > > >results. You might also want to check out DITA by Michael Priestley et al. > > > >of IBM, which I think intends to facilitate topical reuse. > > > > > > > >Roger L. Costello wrote: > > > > > > > >>Hi Folks, > > > >>I am working with some people who wish to migrate from an > > > >>all-prose format to a prose-plus-reusable-XML-fragments > > > >>format. > > > >>They have some data in prose that is useable in many contexts. They > > > >>want to break out that reusable data into XML fragments. However, > > > >>they want to continue to provide the prose style. > > > >>For example, consider this prose data: > > > >><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city > > > >>with many attractions. Miami Beach is a popular attraction. The > > > >>spring tide is ... The neap tide is ... </para> > > > >>Examining this prose we can extract reusable info about the city of > > > >>Miami: > > > >><City id="Miami"> > > > >> <state>Florida</state> > > > >> <population>1,234,000</population> > > > >></City> > > > >>We can also extract reusable info about tide data on Miami Beach: > > > >><TideData id="MiamiBeachTides"> > > > >> <springTide>...</springTide> > > > >> <neapTide>...</neapTide> > > > >></TideData> > > > >>The problem now is to create a framework which allows the prose > > > >>to bring-together the independent, reusable XML components. > > > >>Conceptually, what is desired is a "glue framework" like this: > > > >><para>The <ref href="Miami.xml"> is a sprawling city with > > > >>many attractions. Miami Beach is a popular attraction. The > > > >>tides are <ref href="MiamiBeachTides.xml"><para> > > > >>Thus, the prose is "glueing" together the XML fragments. > > > >>Is this a problem that you have experience with? What "glue > > > >>framework" have you used? What strategy did you use to merge > > > >>the XML fragments with the prose? Is there is a standard way > > > >>of combining semi-structured data with structured data? > > > >>/Roger > > > >> > > > >>----------------------------------------------------------------- > > > >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > > > >>initiative of OASIS <http://www.oasis-open.org> > > > >>The list archives are at http://lists.xml.org/archives/xml-dev/ > > > >>To subscribe or unsubscribe from this list use the subscription > > > >>manager: <http://lists.xml.org/ob/adm.pl> > > > > > > > > > > > > > > > >----------------------------------------------------------------- > > > >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > > > >initiative of OASIS <http://www.oasis-open.org> > > > > > > > >The list archives are at http://lists.xml.org/archives/xml-dev/ > > > > > > > >To subscribe or unsubscribe from this list use the subscription > > > >manager: <http://lists.xml.org/ob/adm.pl> > > > > > > ----------------------------------------------------------------- > > > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > > > initiative of OASIS <http://www.oasis-open.org> > > > > > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > > > > > To subscribe or unsubscribe from this list use the subscription > > > manager: <http://lists.xml.org/ob/adm.pl> > > > > ______________________________________________________________________ > > ----------------------------------------------------------------- > > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > > initiative of OASIS <http://www.oasis-open.org> > > > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > > > To subscribe or unsubscribe from this list use the subscription > > manager: <http://lists.xml.org/ob/adm.pl> > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl> begin:vcard n:Chiusano;Joseph tel;work:(703) 902-6923 x-mozilla-html:FALSE url:www.bah.com org:Booz | Allen | Hamilton;IT Digital Strategies Team adr:;;8283 Greensboro Drive;McLean;VA;22012; version:2.1 email;internet:chiusano_joseph@b... title:Senior Consultant fn:Joseph M. Chiusano end:vcard
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|