searching for search
Regarding the recent "Indexing XML Document Collections" thread... I've been doing some breadth-first search for indexing/query technology, and here is a summary of what i've learned. I'm posting this because I'm interested in the area but don't have the time to investigate all these, and it seems like there are some real experts on this list. I'm interested in these questions: - in general, why would I pick one of these over another (i.e. boolean query vs. structured query; scalability in size or requests; pluggable format drivers for source data; stemming and concept support; etc.) - in general, what are the features that push a technology into another level of complexity and why (i.e. what is so hard here?) - specifically, what are the characteristics of each of these in performance/reliability/features (personal experience from non-vendors and public benchmarks are of course preferred, but vendor claims might be of interest too) - can i safely ignore the non open source ones without giving up capabilities - if all i wanted to do was boolean search on field values with no stemming/concept support, then regardless of how i did the indexing, what is wrong with using standard b-trees and/or just putting the index data in a sql db? indexing/query technologies --------------------------- what: sgrep url: http://www.cs.helsinki.fi/~jjaakkol/sgrep.html license: GPL comment: does structured document grep, with an indexing phase. what: Xtract url: http://www.cs.york.ac.uk/fp/Xtract/ license: GPL comment: another xml grep; more XQL-like. no indexing. what: swish (Simple Web Indexing System for Humans) url: http://www.directive.com/swish.htm license: sort of free comment: see swish-e what: swish-e (swish-enhanced) url: http://sunsite.berkeley.edu/SWISH-E/ license: GPL comment: focused specifically on web site indexing. what: MG (managing gigabytes) url: http://www.mds.rmit.edu.au/mg/intro/about_mg.html license: GPL comment: based on book: http://www.cs.mu.oz.au/mg. commercial version is SIM: http://www.mds.rmit.edu.au what: wais and freeWAIS and freewais-sf/SFgate url: http://www.faqs.org/faqs/wais-faq/freeWAIS-sf/index.html comment: now supplanted by Isearch/Isite. what: Isearch url: http://www.etymon.com/Isearch license: non-copyleft free. comment: Isearch is behind dmoz/newhoo (http://www.news.com/News/Item/0,4,28964,00.html?st.cn.News.today.ne) what: dig or "ht://dig" url: http://www.htdig.org/ license: GPL what: glimpse url: http://glimpse.cs.arizona.edu/ license: non-commercial use, open source. commercial: Readware http://www.readware.com/products.htm Excalibur RetrievalWare http://www.excalib.com/ verity http://www.verity.com oracle intermedia http://www.oracle.com fulcrum http://www.fulcrum.com (now pcdocs) OpenText http://www.opentext.com/ (soon to be pcdocs?) SIM: http://www.mds.rmit.edu.au no cost, but object code only: excite for web servers http://www.excite.com/navigate/ PLS http://www.pls.com/ acquired by AOL. GMD-IPSI XQL http://xml.darmstadt.gmd.de/xql/. thunderstone http://www.thunderstone.com/. webinator is no cost, object code only. "XML Servers" (which can mean anything) bluestone http://www.bluestone.com/ odi excelon http://www.odi.com/ softwareag tamino http://www.softwareag.com/tamino/default.htm poet cms http://www.poet.com/ oracle ifs, dbweb, etc. http://www.oracle.com query/search languages and standards ------------------------------------- Z39.50-1995 http://lcweb.loc.gov/z3950/agency aka ISO 23950 ; formerly ISO 10162 and ISO 10163. basically the U.S. started branching the original ISO standard, and now they lead the ISO standard. WAIS was based on the first version Z39.50-1988. see also http://www.faqs.org/rfcs/rfc1729.html for history see http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april97/04lynch.html and http://slis6000.slis.uwo.ca/~jxerri/index.html GILS (government information locator service) http://www.gils.net/locator.html for technology, just aggregates other projects (uses Isearch, htdig, etc.). at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics, in the "GILS Profile" http://www.gils.net/prof_v2.html [there, i've now saved you from reading a horrific amount of verbiage.] STARTS http://www-db.stanford.edu/~gravano/starts.html a standardization effort like GILS. subsets Z39.50. complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf. SDQL (structured document query language) DSSSL thing. http://www.jclark.com/dsssl/sgml95/sdql.html, http://www.jclark.com/dsssl/IS/dsssl85.htm SOIF (Summary Object Interchange Format) first made up by Harvest in 1994. CIP (Common Indexing Protocol) output of the moribund ietf FIND working group XQL and XML-QL and a gazillion more http://www.w3.org/TandS/QL/QL98/pp.html OQL http://www.odmg.org/standard/odmgbookextract.htm#Chapter 4 Search UI --------- what: WWWWAIS url: http://riceinfo.rice.edu/sw/swish/patches/ comment: web interface to WAIS and SWISH search engines what: webglimpse url: http://donkey.cs.arizona.edu/webglimpse/ comment: web interface what: HURL (Hypertext Usenet Reader & Linker) url: http://impressive.net/software/hurl license: will be free software. comment: uses glimpse underneath Gathering/Spidering ------------------- what: harvest url: http://www.tardis.ed.ac.uk/harvest/ comment: just does the spidering; the index is with glimpse notes:: verity etc. could be used instead of glimpse. does provide a "Broker" cgi around the indexer. maps SGML to "SOIF". :: Papers/Reading on IR -------------------- ACM SIGIR http://www.acm.org/sigir/ news:comp.infosystems.search xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format