[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: searching for search

  • From: Walter Underwood <wunder@i...>
  • To: "Mark D. Anderson" <mda@d...>, <xml-dev@i...>
  • Date: Thu, 27 May 1999 10:57:57 -0700

ic marking search
Disclaimer: I'm a Staff Engineer at Infoseek Corp. I work
on Ultraseek Server, our search product (which can search XML).

At 01:00 PM 5/23/99 -0700, Mark D. Anderson wrote:
>Regarding the recent "Indexing XML Document Collections" thread...
>
>I'm interested in these questions:
>
>- in general, why would I pick one of these over another
>(i.e. boolean query vs. structured query; scalability in size
>or requests; pluggable format drivers for source data;
>stemming and concept support; etc.)

What is your search problem? If professional writers are
searching in a repository they understand, you might give
them pretty complex search. If Joe AOL is searching your
public site, you have to give good results for one-word
queries and give good results on the first page.

>- in general, what are the features that push a technology
>into another level of complexity and why (i.e. what is so
>hard here?)

Making something "excellent" instead of "pretty good" usually
means that you have to actually deal with all the picky cases
instead of pretending they don't exist. For example, our
spider has special code for Lotus Domino, special code to
recognize directory listings generated by various webserver,
special code to handle spaces in filenames on MS FTP servers,
and so on. Our spider has a *lot* more code than our search engine.

>- specifically, what are the characteristics of each of
>these in performance/reliability/features (personal experience
>from non-vendors and public benchmarks are of course preferred,
>but vendor claims might be of interest too)

I'll mostly defer to customer evals and our product web site, 
but for scalability you can try out www.infoseek.de, which has 
about 10 million documents and does about 1 million queries/day. 
The search back end is stock Ultraseek Server, and the front end 
is a custom pagebuilder.

>- can i safely ignore the non open source ones without giving
>up capabilities

Not really, at least according to our customers. In some areas,
open source tools are competitive, in others, they aren't. Search
is the latter. We routinely beat free tools in customer evals.

Personally, I use a free editor (Emacs), but a commercial bug-tracking
system (Globetrack). You've got to make your own evaluations, of
course.

>- if all i wanted to do was boolean search on field values with
>no stemming/concept support, then regardless of how i did the
>indexing, what is wrong with using standard b-trees and/or just
>putting the index data in a sql db?

Relevancy ranking would be nice. Going through thirty pages of
hits really bites.

And stemming does help. Phrase search helps a lot. Counting
inter-site links helps with very short queries. Anti-spam
algorithms help. Field weights help. Find Similar (query by
example) is useful. Indexing Microsoft Word, PDF, PostScript,
and XML is handy. And so on.

Finally, please add this commercial product to your list:

Ultraseek Server: http://software.infoseek.com/

wunder


--
Walter R. Underwood
wunder@i...
wunder@b... (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.