Re: Assisted Search of XML document collections
On Sat, 22 May 1999, Edward C. Zimmermann wrote: > > On Sat, 22 May 1999, Edward C. Zimmermann wrote: > > And I (Arved) had written: > > That is the stage we are at. I have this gut feeling that we need to > > define what it means to have a search engine operate on let's say 100,00 > > documents marked up using XML, and what are the situations where it might > > make more sense to search a file which describes that collection. > 100K documents is not a problem. Even on consumer PC hardware a modestly performant > fulltext engine can handle typical queries on such a small collection in fractions > of a second. The problem is more (beyond quantity) that information resources > (XML, HTML or whatever) are not always static but dynamic. That's, above all, one > of the fundamental flaws in the brute-force spider/crawl approaches followed by > the major "Internet Engines" (beyond the impact on bandwidth, the half-life of > data, and all the other significant shortcommings). > I don't think I'd quite agree that 100K documents is not a problem. Full-text searches using maybe Boolean expressions, yes, that's fast, but querying based on knowledge of the XML structure, i.e. something like the Perl XML::XQL syntax, I'm sorry, I just don't see that kind of query as shrugging off 100K documents. Part of what I'm trying to do is define when an indexing scheme might be appropriate. I'm leaning towards static or slowly varying. One's definition of either would depend on factors such as how long it takes to index, do already-existing documents change also, is the indexing structure such that new documents can be incrementally added, etc etc. I'm not so sure that indexing, as least as I envisage it, is going to handle millions of changing documents on the Web, for example. > > > > Your best contribution would be to describe a business problem and tell us > > how you like to solve it. > Different problems, different methods, different tools. > > Lets turn the tables, since I'm the confused soul, can you explain a bussiness > problem and tell us how you might plan to "solve it".... > > Sure. We put in a tender to supply document management to the local provincial natural resources people, specifically the survey and mapping types. We looked at perhaps 500K to 1M documents, of which (if I recall aright) perhaps 75% were very amenable to being scanned in, zone-OCR'ed, and had enough structure to make them very suitable XML candidates. Maybe 5 DTD's could have described that 75%. You understand that I'm describing a tender a few years old, and that XML wasn't on *anybody's* mind at the time. I'm think of it now, though, as a situation where XML would be really appropriate for allowing the kinds of searches these guys wanted to do. Plus they wanted to eventually make much of this info available via the Web, or run off paper copies; again, XML markup seems just right, and convert into other formats as required. OK, as to searching and indexing. Of the "searchable" documents I describe, all were static - they were *records*. Probably the number of similar documents added in a given year would be 3-5% of the existing archive. So an index would be a very manageable thing, and would rarely change. So, you understand, my viewpoint is record-centric. That's why I'm asking for input. Arved xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format