[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: More on Vector Models


vector models
David, you seemed to be determined to play Bacchus come to 
demand a seat at the table of the gods by disruption.  
It's an immature strategy.  

Yes, VSM is a form of document indexing and classification. 
It uses term frequency to create similarity metrics, typically 
a cosine for the angle between terms which normalizes the 
distance.  There are a LOT of papers you can read freely 
available simply by entering "vector space model" into 
that ever-loving simple box that does such a good job for 
cases where SQL falls on its bum.  No structure == no SQL.

So one last time as clear as I can:

1.  The problem of weakly structured (think RSS) or 
unstructured (think notepad files) data is classification. 

2.  The problem of XML is it requires apriori classification 
that may result in weak structuring or high costs.

3.  The problem of the publish/subscribe model is that it 
invokes problems one and two automatically if a human does 
not intervene.  Notification based systems rely on triggers 
because humans know where to put those.  Humans are expensive 
and make mistakes.  Wyman is right: analyze the query. 

This is the classic pattern identification problem.  Regardless 
of the database system you use, query analysis is required to 
enable matching.  In an unstructured or weakly structured world, 
the information of interest is in the text nodes.  It is like 
having a message system that only contains two fields: call and response.

Vector Space Models and others like use term frequency to establish 
similarity metrics.   These metrics can be used to cluster documents 
with similar content even in the face of polysemy and synonymy.  These 
are relatively old techniques and do require preprocessing but HTML 
was an even older technique as was markup before they were recognized 
by the database community.  

Again, the problem of XML is apriori classification.  Just as HTML was a
leap backwards 
to make forward progress, the publish/subscribe methods, particularly 
where based on weakly tagged message formats such as RSS require another
look 
to the past to bring forward the worst/best of the IR technologies because
the  
formats and models create exactly the same problems.  The database gurus of
fifteen 
years ago did not believe markup was a solution for database integration
issues. 
The markup gurus of today don't believe that geometry is a solution for 
pattern analysis.  The past is not always informative if the environment 
has changed; on the other hand, a proven technique in a new environment 
can work better.  HTML and XML are the proof that for the most part, 
the SGMLers were right and the database experts were wrong.

A day in the library is worth a month in the lab.

len




From: David Lyon [mailto:david.lyon@c...]

ok, well I'm lost. Vectors are a simple mathematic
paradigm. How do they apply to xml? or is it just
a new type of marketing speek?

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.