[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XML Search Engine

  • From: Tim Bray <tbray@t...>
  • To: "Michael Kay" <M.H.Kay@e...>, <xml-dev@i...>
  • Date: Thu, 05 Nov 1998 09:52:14 -0800

proximity search algorithm
At 12:27 PM 11/5/98 -0000, Michael Kay wrote:
>Switching thrreads, I am a little surprised by Tim's remarks on word
>proximity versus character proximity. Confining our attention to European
>languages (as most search engines do), word proximity searching is a common
>feature of the high-end search engines, whereas character proximity is
>hardly found outside basic desktop tools like grep. 

What I said was:
1. I have not seen any research which demonstrates that word proximity
   achieves better results than character proximity based on any
   well-known IR metric.
2. Doing word proximity at all is a *very* hard problem in the languages
   used by a large majority of the world's population.

>Apart from anything
>else, once you've done the word normalisation (normalising different
>linguistic forms or spellings of the same word), character proximity is
>meaningless. In the older boolean engines word proximity is used rather
>mechanistically, in the newer engines it is used more subtly as part of a
>statistical or linguistic approach to relevance ranking

If you go poking around either in the SIGIR world (that would be the 
Association for Computing Machinery's Special Interest Group on 
Information Retrieval) or in the actual commercial retrieval engine
world, you find a distressing lack of technology progress.  Yes, with
modern engines, precision & recall are measurably better than they
were in 1978.  But 10 times as good?  Hah!  Twice as good?  Maybe,
for certain restricted application domains.  Given all this, I'm
less than impressed about the subtle techniques of modern engines.
On top of which, most of the techniques used in the "advanced" engines
are basically Anglocentric and fall apart once you get outside the
English-speaking world.

> but either way it
>is an established feature of the scene, and it is not there on whim: the
>search algorithms used are based on extensive research and benchmarking of
>relevance and recall scores.

Yeah, well, it's *not* an established feature of the scene in Asia.  Maybe
it's just an irrational prejudice, but I'm not all that interested in
computing techniques that are not usable by a large majority of the
world's population.  And once again, I challenge the assertion that,
for all these clever heuristics, real-world retrieval software is
really much better than it was 20 years ago.

>An interesting comparison of web search engines is at
>http://www.netstrider.com/search/features.html ; this asserts that all the
>well-known web search engines other than Lycos use word proximity matching.

And we know what wonderful results they produce (that's in English; for
real joy, go try a tricky in German - even European languages sometimes
leave out the spaces between the words - and see what happens).  -Tim

PS: Given my grouchy tone, I should say that I'm dazzled at the
inventiveness, deep thought, and creativity that have been invested
in the IR field in recent decades.  The fact the results are so
underwhelming is evidence of how hard the problems are... the real
lesson is that we should marvel at the language-processing apparatus
we carry around between our ears. -T

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.