[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: An alternative formulation of the document-centric/data-ce


data centric vs document centric
At 11:01 AM 6/3/2004 +0100, Sean McGrath wrote:
Document-centric XML:
        XML in which corpora conforming to schema X, exhibit power law distributions of the element types in X.

Data-centric XML:
        XML in which corpora conforming to schema X, exhibit uniform distributions of the element types in X.

Not perfect but useful nonetheless I think. Mixed content is missing for a start.

Anyway, please take a look at the graphs at:
        http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html#108576202776583412

I'd be very interested in seeing other peoples graphs of the tag-share of their XML corpora.

This reminds me of a classic paper by Darrell Raymond and Frank Tompa called "Hypertext and the Oxford English Dictionary" from the Communications of the ACM in 1988 or so.   At Waterloo -- Tim Bray was also part of this work at the time -- they had a research program on how to handle large text data/hypertexts like the OED (in preparation to create electronic versions) and they did a lot of very clever analyses of the dictionary, which had just been turned into SGML via conversion from the typesetting tapes.   The paper includes several charts showing the distribution of  (a) entry length,  (b) number of tags per entry (c), number of cross references and so on and either explicitly or implicitly they show tag-share in the dictionary to have the kind of distribution that Sean has in his analyses. 

Rick Jellife has some software that does the same sort of thing that I saw demonstrated at the GCA XML conferences the last year or so.

But I don't buy into this data-centric vs doc-centric view of the world. It is obviously a continuum   (called the "Document Type Spectrum" in the Document Engineering book  I'm writing with Tim McGrath [just about done, MIT Press early 2005]).   On one end are pure narrative things and on the other end are purely transactional ones:   Moby Dick to invoices.  IIn the middle are hybrid types like catalogs and reference books that have lots of structured content mixed in with narrative content. 

 I always use Moby Dick as the endpoint when I talk about this because its opening line is "call me XML"  or something like that. :-)

-bob glushko




PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.