[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: SAX, DOM, and Search Engines (was Re: xml parser)
At 05:32 PM 11/4/98 -0500, david@m... wrote: >Tim Bray writes: > > I disagree. Few parsers track byte offsets or other locational info in > > the file, and I think you need that to do basic things like proximity > > and phrase search. > >I disagree. While byte offsets might be useful for other purposes, >they would be inappropriate for proximity and phrase searches -- for >those, you need to track the relative positions of words, not their >absolute positions. Consider the following example: > > <p>WORD1 &x; WORD2</p> >Is WORD1 close to WORD2? Clearly, the proximity tests have to work in terms of proximity in the cooked, not raw, text. Lark carefully tracks offsets in terms of the entity stack so you can do this. But that's so obvious I don't think it's your point. Secondly, for proximity, you're worried about counting characters, not bytes, but for addressing back into the entity, you're worried about byte, not character, offsets. So it's even harder than it looks. Unless of course you're using UTF16 and staying in the BMP - which might be a REAL good idea in an IR-oriented system anyhow. > It's only five bytes away (assuming an 8-bit >encoding), but might be separated by 20,000 words, depending on what >&x; expands to. SAX and the DOM do give you enough information to >determine the relative positions of words. [warning: simple argument with long embedded digression] I don't think so. How about languages, such as those spoken by the majority of the world's inhabitants, that do not separate words with spaces? (Identifying word breaks in running Japanese or Chinese text is essentially a strong-AI problem. You can get decent results by running a dictionary and searching at each character break for a match, with morphological heuristics, but it turns out that in those languages there is sufficient encoding redundancy that you get pretty good results (at a cost of some space wasteage) just treating most characters as words - and lurking in that fact there's a PhD in linguistics for someone - but I digress, I spent a long time in those particular mines). But spotting "words" may not matter. In fact, I am not aware of any research that shows word proximity to be a better information retrieval heuristic than character proximity. And it's much easier to nail down what you mean by "character" than "word", and thus get deterministic cross-language behavior. >Byte offsets would be helpful for displaying context around a match, >but there would be no 100% reliable way to format that context without >starting from the top of the document unless you used the whizzy new soon-to-arrive W3C fragment packager, right? Actually, if you have an index that can understand the the structure well enough to support xpointer-flavor querying, the engine is going to know all the context info, so this should actually work pretty well (but only if you know the byte/character offsets). And the right way to display results in context depends on whether you're sampling, or visiting match. OK, you've been warned... if you get me going on the problems of searching in tagged internationalized text, bring a windbreaker - you'll need it. -Tim xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|