Re: searching for search
On Sun, May 23, 1999 at 11:31:43PM +0200, Edward C. Zimmermann wrote: > > - in general, what are the features that push a technology > > into another level of complexity and why (i.e. what is so > > hard here?) > Designing fulltext engines is not difficult :-) Making them updateable, small and fast is. Full structure indexing, in particular, is easy in principle but quite awkward to make efficient and small in a dynamic update environment. phrase querying is another that has to be done carefully to simultaneously offer speed while avoiding space blowouts. It is quite common for engines to generate indexes that are two to ten times the size of the data. Furthermore, this is generally considered acceptable (SIM rarely goes over half the size of the data -- we haven't implemented full structure or phrase yet, but have done some research to establish the cost; structure indexes should be minimal and efficient phrase indexes should roughly double our current index size). Here is a list of features that have the potential to push complexity (the list is neither comprehensive, nor in any particular order): * Size minimisation * Large collections (e.g. exceeding 2 or 4 GB can pose unique problems that are non-trivial to solve) * Performance * Interactive updates * Full-structure querying * Phrase querying * Transactions * Incremental backups (important for large collections) * Multi-database queries * Multi-database ranked/sorted queries * Multi-server queries * Multi-server ranked/sorted queries * Multi-server multi-vendor queries * Multi-server multi-vendor ranked queries (I list the various multi-database options separately because each of them introduces new and quite different issues, though some of the issues may only arise in the context of Z39.50, with which we deal.) > > - specifically, what are the characteristics of each of > > these in performance/reliability/features (personal experience > > from non-vendors and public benchmarks are of course preferred, > > but vendor claims might be of interest too) > > > > - can i safely ignore the non open source ones without giving > > up capabilities > What do you mean (I seem to be on a roll at not understanding > questions these days)? Possibly the triple negative (_ignore_, _non_, _without_) contributed in this case. :-) > > - if all i wanted to do was boolean search on field values with > > no stemming/concept support, then regardless of how i did the > > indexing, what is wrong with using standard b-trees and/or just > > putting the index data in a sql db? > To make the answer short: depends upon what you want to do. A slightly longer answer is, if you have 100GB of data that you want to index in an SQL database then you'd better grab a terabyte of hard disk and be prepared to wait a LONG time for your queries to come back to you. Cheers, Marcelo -- http://www.simdb.com/~marcelo/ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format