Re: searching for search
Disclaimer: I'm a Staff Engineer at Infoseek Corp. I work on Ultraseek Server, our search product (which can search XML). At 01:00 PM 5/23/99 -0700, Mark D. Anderson wrote: >Regarding the recent "Indexing XML Document Collections" thread... > >I'm interested in these questions: > >- in general, why would I pick one of these over another >(i.e. boolean query vs. structured query; scalability in size >or requests; pluggable format drivers for source data; >stemming and concept support; etc.) What is your search problem? If professional writers are searching in a repository they understand, you might give them pretty complex search. If Joe AOL is searching your public site, you have to give good results for one-word queries and give good results on the first page. >- in general, what are the features that push a technology >into another level of complexity and why (i.e. what is so >hard here?) Making something "excellent" instead of "pretty good" usually means that you have to actually deal with all the picky cases instead of pretending they don't exist. For example, our spider has special code for Lotus Domino, special code to recognize directory listings generated by various webserver, special code to handle spaces in filenames on MS FTP servers, and so on. Our spider has a *lot* more code than our search engine. >- specifically, what are the characteristics of each of >these in performance/reliability/features (personal experience >from non-vendors and public benchmarks are of course preferred, >but vendor claims might be of interest too) I'll mostly defer to customer evals and our product web site, but for scalability you can try out www.infoseek.de, which has about 10 million documents and does about 1 million queries/day. The search back end is stock Ultraseek Server, and the front end is a custom pagebuilder. >- can i safely ignore the non open source ones without giving >up capabilities Not really, at least according to our customers. In some areas, open source tools are competitive, in others, they aren't. Search is the latter. We routinely beat free tools in customer evals. Personally, I use a free editor (Emacs), but a commercial bug-tracking system (Globetrack). You've got to make your own evaluations, of course. >- if all i wanted to do was boolean search on field values with >no stemming/concept support, then regardless of how i did the >indexing, what is wrong with using standard b-trees and/or just >putting the index data in a sql db? Relevancy ranking would be nice. Going through thirty pages of hits really bites. And stemming does help. Phrase search helps a lot. Counting inter-site links helps with very short queries. Anti-spam algorithms help. Field weights help. Find Similar (query by example) is useful. Indexing Microsoft Word, PDF, PostScript, and XML is handy. And so on. Finally, please add this commercial product to your list: Ultraseek Server: http://software.infoseek.com/ wunder -- Walter R. Underwood wunder@i... wunder@b... (home) http://software.infoseek.com/cce/ (my product) http://www.best.com/~wunder/ 1-408-543-6946 xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format