[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Something altogether different?
Steve DeRose wrote: >> To use markup well for this, it seems like you have to know something about its semantics -- which is hard, but maybe avoidable. >> Maybe we can apply a similar Hidden Markov Model for documents and markup analysis? >> On the other hand, what about a simpler approach to analyzing and using markup: what if Google were to do nothing more than to allow you to search for your words/phrases *only* within particular element types? At the risk of being repetitive, I'll point again to Cohen's research at AT&T on WHIRL. (He's now at Carnegie Mellon.) Cohen's work takes the approach of representing a document as a set of terms and computing textual similarity. "the term-weight representation of a document can be a surprisingly effective model of its semantic content; in particular, documents with intuitively similar semantic content often have similar representations." Claude Bullard wrote: >> Abstract topics regardless of the kind of expression used (eg, HTML vs X3D or SVG) should have the same vector values. Cohen used Salton's model with fragments of text represented by document vectors. From Cohen's '99 WHIRL paper: "One advantage of this "vector space" representation is that the similarity of two documents can be easily computed. The excerpt below is from the 1999 paper. In a more recent paper published in the ACM Transactions on Information Systems, he wrote: http://www-2.cs.cmu.edu/~wcohen/postscript/tois-whirl.pdf "Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second." Excerpts from the 1999 paper: -------------------------------- In this paper we describe WHIRL (for Word-based Heterogeneous Information Representation Language), a new type of information system that synergistically combines logic-based and text-based representation methods. With respect to text, WHIRL adopts a key tool of modern text-based information systems: the term-weight representation for text, in which a document is represented as a set of terms, each associated with a numeric weight indicating its relative importance. (This is sometimes called a "bag of words" representation, since terms usually correspond to words). Term-based representations can be easily created and stored, and with suitable indices, many operations can be carried out very efficiently. Another advantage of this representation is that with a good weighting scheme, the term-weight representation of a document can be a surprisingly effective model of its semantic content; in particular, documents with intuitively similar semantic content often have similar representations. ... In WHIRL, this notion of similarity has been closely integrated with logical deduction. WHIRL is a conventional logic (a subset of non-recursive Datalog) that has been extended by introducing an atomic type for textual entities, and an atomic operation for computing textual similarity. The presence of the "soft" similarity predicate necessitates a "soft" semantics; inferences in WHIRL are associated with numeric scores, and presented to the user in decreasing order by score, much like the documents returned by a ranked-retrieval IR system. ... We will show that WHIRL strictly generalizes both IR ranked retrieval and logical deduction; that non-trivial queries concerning large databases can be answered efficiently; that WHIRL can be used to integrate data from distinct, distributed, heterogeneous, information sources, such as those found on the Web; that WHIRL can be used effectively for inductive classification of text; and finally that WHIRL can be used to extract data from structured documents, and to semi-automatically generate "wrappers" (extraction programs) for structured documents. ... The general idea behind the vector representation is that the magnitude of the component vt is related to the "importance" of the term t in the document represented by ... One advantage of this "vector space" representation is that the similarity of two documents can be easily computed.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|