[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: More on Vector Models
David, you seemed to be determined to play Bacchus come to demand a seat at the table of the gods by disruption. It's an immature strategy. Yes, VSM is a form of document indexing and classification. It uses term frequency to create similarity metrics, typically a cosine for the angle between terms which normalizes the distance. There are a LOT of papers you can read freely available simply by entering "vector space model" into that ever-loving simple box that does such a good job for cases where SQL falls on its bum. No structure == no SQL. So one last time as clear as I can: 1. The problem of weakly structured (think RSS) or unstructured (think notepad files) data is classification. 2. The problem of XML is it requires apriori classification that may result in weak structuring or high costs. 3. The problem of the publish/subscribe model is that it invokes problems one and two automatically if a human does not intervene. Notification based systems rely on triggers because humans know where to put those. Humans are expensive and make mistakes. Wyman is right: analyze the query. This is the classic pattern identification problem. Regardless of the database system you use, query analysis is required to enable matching. In an unstructured or weakly structured world, the information of interest is in the text nodes. It is like having a message system that only contains two fields: call and response. Vector Space Models and others like use term frequency to establish similarity metrics. These metrics can be used to cluster documents with similar content even in the face of polysemy and synonymy. These are relatively old techniques and do require preprocessing but HTML was an even older technique as was markup before they were recognized by the database community. Again, the problem of XML is apriori classification. Just as HTML was a leap backwards to make forward progress, the publish/subscribe methods, particularly where based on weakly tagged message formats such as RSS require another look to the past to bring forward the worst/best of the IR technologies because the formats and models create exactly the same problems. The database gurus of fifteen years ago did not believe markup was a solution for database integration issues. The markup gurus of today don't believe that geometry is a solution for pattern analysis. The past is not always informative if the environment has changed; on the other hand, a proven technique in a new environment can work better. HTML and XML are the proof that for the most part, the SGMLers were right and the database experts were wrong. A day in the library is worth a month in the lab. len From: David Lyon [mailto:david.lyon@c...] ok, well I'm lost. Vectors are a simple mathematic paradigm. How do they apply to xml? or is it just a new type of marketing speek?
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|