[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Something altogether different?

  • To: "'Steven J. DeRose'" <sderose@a...>, 'XML Developers List' <xml-dev@l...>
  • Subject: RE: Something altogether different?
  • From: "Bullard, Claude L (Len)" <len.bullard@i...>
  • Date: Mon, 25 Apr 2005 14:14:29 -0500

kuropka
Title: RE: Something altogether different?
A cool million if you can prove the Riemann hypothesis, and the eternal displeasure of
some cryptogeeks, but you definitely get your own personal footnote in the math books.
 
Yes, the system can be reliable in the face of noise if it is affordable, and that is 
cost vs requirement.  Speed is money; how fast can you afford to go.   I don't dispute
that what Bosworth is talking about will work.  Weaken the measurements and you
can fit the moon inside a bag... theoretically.
 
I would suspect that the markup-search-inside approach yields results the way searching
values inside rows does.  It's grouped apriori.   Also, doesn't the fact of a GI
in the context of other GIs reduce ambiguity (ummm... sure)?  Cosmic d'oh. 
 
BTW:  Kuropka goes beyond Salton by not assuming independent terms.
 
http://www.kuropka.net/Dateien/TVSM.pdf
 
I see what you mean about co-relevance and the document level.
 
It is the generality of a URI namespace as a topical name and that some names occur closer or at higher
frequencies than other names thus clustering among the vectors that caught my attention.
If topics are vector spaces, and topics are grouped, a tensor product can be used to
group the the vectors.  All qubits are vectors and tensors can be used to group these.
Abstract topics regardless of the kind of expression used (eg, HTML vs X3D or SVG) should
have the same vector values.  The vector product is another kind of address as we discussed
in the Hytime era.
 
Useful?  I can't tell.  It's intuitively appealing.  If a schema circumscribes the topicality of a
document, it is a tensor product of qubits.  My math is too deficient to get past the intuition.
 
Awfully glad you are back, Steve.
 
len

From: Steven J. DeRose [mailto:sderose@a...]
At 11:58 -0500 2005-04-22, Bullard, Claude L (Len) wrote:
1)  Can processes be reliable given noisy data

Of course -- just have to read Claude Shannon. Though here's one interesting bit about noise I just ran into: At http://www.maths.ex.ac.uk/~mwatkins/zeta/surprising.htm

>Indirectly, as a result of studying nonlinear dynamics Marek Wolf discovered two instances of apparent fractality within the distribution of prime numbers ([W2-3]). These discoveries were realised experimentally using powerful computers. Wolf's resulting interest in the distribution of the primes led him to experimentally discover  the presence of 1/f [pink] noise when the  primes are treated as a 'signal' in the sense of information theory ([W4]). This is also a self-similar (scale invariant, or fractal) property of the distribution of primes.

Connecting noise/information theory to the Reimann hypothesis -- now there's "Something altogether different", especially disruptive because trapdoor encryption methods depend on our not knowing how to find prime factors fast enough.... Oops....

...
I suggest a review of the works of Salton et al on
the vector space model, and the new refinements of
Dominick Kuropka et al on topic-based vector space
models.  Consider these in terms of namespaces as
provided by XML, and the implications given aggregate
...

I once spent a while working on the idea of incorporating markup into Salton-like metrics. The problem I ran into was that Salton's stuff is working solely at document-level, so even the fact that two words are merely at opposite ends of the document (versus being adjacent, for example) doesn't enter in. So markup giving you finer distinctions of co-relevance wouldn't help. First you have to find how to apply Salton-ish methods to finer-grained objects, which is not trivial. There are a couple papers on that, but last I looked, nothing very effective. To use markup well for this, it seems like you have to know something about its semantics -- which is hard, but maybe avoidable.

In AI, a similar issue of accuracy vs. speed/simplicity/scalability in the face of noise and ambiguity was solved in the late 80's. Turns out, it's hard to assign the right part of speech to words. Almost everything is ambiguous (like "dog" can be a verb). Linguists had shown that there are cases where you *cannot* determine which way a word is functioning without knowing the whole semantics -- the reliability issue again. But getting the whole semantics is a lot of work, especially if you haven't figured out the part of speech yet. Ken Church and I showed in 1987 that you could get *better* reliability with purely statistical methods that ignored semantic questions. Yes, we got the "proof" cases wrong -- but we did better overall, and the method was practical (about O(ln N) instead of O(N**3), for any geeks among us). Now part-of-speech is nearly always done that way.

Maybe we can apply a similar Hidden Markov Model for documents and markup analysis? If I had a grant I'd have time to write out a solution, but unfortunately it won't fit in the margin of this email. :)

On the other hand, what about a simpler approach to analyzing and using markup: what if Google were to do nothing more than to allow you to search for your words/phrases *only* within particular element types? No knowledge if what the elements mean, maybe even no knowledge of what schema or namespace. Just use exactly the same code they use to support "site:" and other prefixes. All of a sudden you can do some amazing things with XML data, and you get some help with HTML, too. Yes, it's badly broken and inadequate in a bunch of ways -- very much like URIs, which are equally broken but have served admirably anyway. It would also motivate use of markup and markup standardization big-time.

Now *there's* something completely different. Not because it's hard or brilliant -- but because, like TimBL's original Web idea, it would simply ignore the really tough problem of solving semantics and the cases we know won't work right, and just get on with it. Which for some purpose (not missile-targeting, please!) is fine.

Steve


-- 
Luthien Consulting: Real solutions to hard information management problems
   Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@a...

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.