[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: indexing and querying XML (not XQuery)
At 09:25 -0400 2005-08-23, Alan Gutierrez wrote: >* Robert Koberg <rob@k...> [2005-08-23 09:06]: >> Hi, >> >> Someone on the Lucene user's list posted a link to this paper: > >> >>http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-02-08/03-02-08.html > > > that talks about indexing and searching XML documents. I have been doing >... > Reading through the article, the thing that strikes me is that > it that full text search of an XML document depends so much on > the structure of the document. If that document can be divided > into chapters, messages, articles, pages, etc, then it's best to > create a full-text index with application specific documents. I'm not quite sure what you mean by "depends so much on the structure of the document". Certainly if you want to do searching that makes use of the markup, that depends on the markup. But it seems like you may be thinking something more like that search is so tied to the details of a particular schema, or that it may be impractical to make a generic search engine. If so, I disagree. There have been search implementations that do a good job with generic XML. I'm also puzzled by what you mean by "application specific documents", and the part about "dividing" documents up. There are many information management solutions that sadly force you to "chunk" your information at a single level -- for example, a client of mine asked me to sit in on meetings with another consulting firm, that they had hired to index a lot of their XML information -- which was organized hierarchically (as some reasonable % of XML data is, after all). They had just run into the snag that the system they were using (which I'll leave unnamed) could not really operate that way (marketing literature notwithstanding). They were forced to pick one single level (chapter, paragraph, section, or some such), and *only* at that level could you: * checkin/checkout * search for co-occurrence or proximity of terms ("and", "near") ...etc... If two things ended up in separate "chunks", as far as searching knew they were in separate unrelated documents. If you wanted to be able to sometimes search for terms co-occurring in the same paragraph, and other times in the same section, forget it. Also, the cost to reconstruct a whole document from its "chunks" was high -- though you had to do that every time you wanted a whole document to export, or print, or validate, or.... The other consulting firm was over a barrel because they had written complicated "chunking" code to break the XML into the required chunks, and schema revisions meant they had to rewrite all of that. They really were trying hard (I was nice to them -- they were clearly sweating a lot and realized the problem they had stuck themselves with); the indexing tool they chose hamstrung them in a lot of ways that were very hard to see at the start, but very painful once seen. I mention this not because that system is unusual but because it *isn't*. There are *many* indexing systems with just this kind of behavior: They deal with exactly *one* level of structure. The situation is really even worse. Think through *all* of the schema you're dealing with. Are there footnotes, revision markup, effectivity, hyperlinks...? Most schemas pose at least a few really nasty problems for "chunk-style" indexing. > > So, perhaps, the scaleable solution, is full-text engine that > is fed a XML documents, and a full-text indexing schema. > > The existing schema langauges like to atomize documents, while a > full-text indexing schema might group their elements into > concepts, like paths, links, articles, and clues for ranking > articles based on conditions specified in XPath. This is an interesting notion. Do you mean that existing *XML* schema languages like to atomize, or that existing *indexer* "schemas" do? It sounds like you're saying that XML schemas do, which seems to me incorrect in the sense of "atomize" that matters here. XML schemas give you not only "atoms," but a huge variety of complex "molecules" and other structures. Many indexers, OTOH, *really* atomize: to the extent they only deal with one kind of structure, despite the diversity of reality. As it is, most indexers *do* have an "indexing schema," though they don't call it that, and it's hard-wired/unchangeable. It's commonly fairly pathetic: document ::= chunk+ It seems to me the problem isn't at the XML schema end. If our data was structured the way many indexers *want* it to be, we could trivially write XML schemas for that and trivially transform our documents into it. But if you really do that, there isn't much structural information left in your documents: and therefore the indexers can't use it to advantage. We did put all that markup in there for a reason, didn't we? I hope.... Indexing systems that took the actual XML schema seriously, might do all you need. Are there things an ideal "indexing schema" would include, that's not in the XML schema already? If so, that's a *very* interesting topic to pursue, I think. And if so, which of those things really *should* be in the XML to start with? Are they *really* only useful for indexing? I rather doubt it. I contend that like formatting information, indexing information should be derivable *from* the XML markup. If an "indexing schema" isn't simply derivable by rule from the existing markup, then the information isn't in the input, right? Or at least, isn't explicit, which is what counts for processing. > > I've wanted to explore the use of Lucene in my document object > model, so I'd like to hear more about this. There are many indexing solutions out there, many of them quite good for what they do. I looked at Lucerne a long time ago and it seemed pretty nice overall, though I've lost track of the details by now. If I remember right, it did have to break things down pretty finely, though it could do some kinds of searches across the chunks. That approach tends to problems where searches get very complicated. For example if you want to find X anywhere within elements of type T, you may have to do a big OR to account for all the things that might be in between: X in T or X in EMPH in T or X in P in T or X in EMPH in P in T or X in fox in socks in T.... Otherwise you simply miss all the cases you didn't mention. Or, there might be a single user search command that does that easily for the user, but expands to the gory "or" inside and gets real slow. Just be really careful in evaluating whatever engines you look at. It's not extremely hard to build a completely structure-aware indexer (though optimizing them for really huge document collections is harder). But they're still not common, and indexers that weren't built specifically for XML from the beginning, often have many surprises awaiting the unwary. Best wishes, Steve PS: The "chunking" or single-level issue was a hot topic in hypertext and information retrieval articles in the late 80's and early 90's, and much of what was written then is, perhaps surprisingly, still timely today. If inclined, check out the Proceedings of the yearly ACM "Hypertext" conferences. Also some of this was discussed during the W3C "QL98" conference in Cambridge (that kicked off the W3C work on querying), available at http://www.w3.org/TandS/QL/QL98/. Of course the first thing you'll want to read from there is my paper... ;) http://www.w3.org/TandS/QL/QL98/pp/linkhier.html And as always, the Cover Pages have a wealth of good info, for example at http://xml.coverpages.org/xmlQuery.html -- Luthien Consulting: Real solutions to hard information management problems Specializing in XML, schema design, XSLT, and project design/review/repair Steven J. DeRose, Ph.D., sderose@a...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|