PRODUCTS

DOWNLOAD

BUY

LEARN

SUPPORT

COMPANY

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

searching for search

From: "Mark D. Anderson" <mda@d...>
To: <xml-dev@i...>
Date: Sun, 23 May 1999 13:00:42 -0700

Play the video

Regarding the recent "Indexing XML Document Collections" thread...

I've been doing some breadth-first search for indexing/query
technology, and here is a summary of what i've learned.
I'm posting this because I'm interested in the area but don't
have the time to investigate all these, and it seems like
there are some real experts on this list.

I'm interested in these questions:

- in general, why would I pick one of these over another
(i.e. boolean query vs. structured query; scalability in size
or requests; pluggable format drivers for source data;
stemming and concept support; etc.)

- in general, what are the features that push a technology
into another level of complexity and why (i.e. what is so
hard here?)

- specifically, what are the characteristics of each of
these in performance/reliability/features (personal experience
from non-vendors and public benchmarks are of course preferred,
but vendor claims might be of interest too)

- can i safely ignore the non open source ones without giving
up capabilities

- if all i wanted to do was boolean search on field values with
no stemming/concept support, then regardless of how i did the
indexing, what is wrong with using standard b-trees and/or just
putting the index data in a sql db?

indexing/query technologies
---------------------------
what: sgrep
url: http://www.cs.helsinki.fi/~jjaakkol/sgrep.html
license: GPL
comment: does structured document grep, with an indexing phase.

what: Xtract
url: http://www.cs.york.ac.uk/fp/Xtract/
license: GPL
comment: another xml grep; more XQL-like. no indexing.

what: swish (Simple Web Indexing System for Humans)
url: http://www.directive.com/swish.htm
license: sort of free
comment: see swish-e

what: swish-e (swish-enhanced)
url: http://sunsite.berkeley.edu/SWISH-E/
license: GPL
comment: focused specifically on web site indexing.

what: MG (managing gigabytes)
url: http://www.mds.rmit.edu.au/mg/intro/about_mg.html
license: GPL
comment: based on book: http://www.cs.mu.oz.au/mg. commercial version is SIM: http://www.mds.rmit.edu.au

what: wais and freeWAIS and freewais-sf/SFgate
url: http://www.faqs.org/faqs/wais-faq/freeWAIS-sf/index.html
comment: now supplanted by Isearch/Isite.

what: Isearch
url: http://www.etymon.com/Isearch
license: non-copyleft free.
comment: Isearch is behind dmoz/newhoo (http://www.news.com/News/Item/0,4,28964,00.html?st.cn.News.today.ne)

what: dig or "ht://dig"
url: http://www.htdig.org/
license: GPL

what: glimpse
url: http://glimpse.cs.arizona.edu/
license: non-commercial use, open source.

commercial:
Readware http://www.readware.com/products.htm
Excalibur RetrievalWare http://www.excalib.com/
verity http://www.verity.com
oracle intermedia http://www.oracle.com
fulcrum http://www.fulcrum.com (now pcdocs)
OpenText http://www.opentext.com/ (soon to be pcdocs?)
SIM: http://www.mds.rmit.edu.au

no cost, but object code only:
excite for web servers http://www.excite.com/navigate/
PLS http://www.pls.com/ acquired by AOL.
GMD-IPSI XQL http://xml.darmstadt.gmd.de/xql/.
thunderstone http://www.thunderstone.com/. webinator is no cost, object code only.

"XML Servers" (which can mean anything)
bluestone http://www.bluestone.com/
odi excelon http://www.odi.com/
softwareag tamino http://www.softwareag.com/tamino/default.htm
poet cms http://www.poet.com/
oracle ifs, dbweb, etc. http://www.oracle.com

query/search languages and standards
-------------------------------------

Z39.50-1995 http://lcweb.loc.gov/z3950/agency
aka ISO 23950 ; formerly ISO 10162 and ISO 10163.
basically the U.S. started branching the original ISO standard, and now they lead the ISO standard.
WAIS was based on the first version Z39.50-1988.
see also http://www.faqs.org/rfcs/rfc1729.html
for history see http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april97/04lynch.html and
http://slis6000.slis.uwo.ca/~jxerri/index.html

GILS (government information locator service) http://www.gils.net/locator.html
for technology, just aggregates other projects (uses Isearch, htdig, etc.).
at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics,
in the "GILS Profile" http://www.gils.net/prof_v2.html
[there, i've now saved you from reading a horrific amount of verbiage.]

STARTS http://www-db.stanford.edu/~gravano/starts.html
a standardization effort like GILS. subsets Z39.50.
complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf.

SDQL (structured document query language)
DSSSL thing. http://www.jclark.com/dsssl/sgml95/sdql.html, http://www.jclark.com/dsssl/IS/dsssl85.htm

SOIF (Summary Object Interchange Format)
first made up by Harvest in 1994.

CIP (Common Indexing Protocol)
output of the moribund ietf FIND working group

XQL and XML-QL and a gazillion more http://www.w3.org/TandS/QL/QL98/pp.html

OQL http://www.odmg.org/standard/odmgbookextract.htm#Chapter 4

Search UI
---------
what: WWWWAIS
url: http://riceinfo.rice.edu/sw/swish/patches/
comment: web interface to WAIS and SWISH search engines

what: webglimpse
url: http://donkey.cs.arizona.edu/webglimpse/
comment: web interface

what: HURL (Hypertext Usenet Reader & Linker)
url: http://impressive.net/software/hurl
license: will be free software.
comment: uses glimpse underneath

Gathering/Spidering
-------------------
what: harvest
url: http://www.tardis.ed.ac.uk/harvest/
comment: just does the spidering; the index is with glimpse
notes::
verity etc. could be used instead of glimpse.
does provide a "Broker" cgi around the indexer.
maps SGML to "SOIF".
::

Papers/Reading on IR
--------------------
ACM SIGIR http://www.acm.org/sigir/

news:comp.infosystems.search

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)

Follow-Ups:
- Re: searching for search
  - From: Walter Underwood <wunder@i...>
- Re: searching for search
  - From: "Edward C. Zimmermann" <edz@b...>

Prev by Date: Weighing in on XSL / Standards
Next by Date: Re: searching for search
Previous by thread: Weighing in on XSL / Standards
Next by thread: Re: searching for search
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >