[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Granularity
On 06/01/12 15:46, Cox, Bruce wrote: > When developing our Reference Document Management Service, we asked the editors. The problem we faced when developing the original search interface for the CELT documents (celt.ucc.ie) was identifying whom to satisfy. Now (20 years on), we know more about the user population and their requirements, but it was assumed at the start that returning adequate context would be essential. The original interface is no longer available (I have it on a SunOS 4.1.3 system disk that won't boot, and somewhere on a chain of 50 QIC tapes, so I will retrieve it one day :-) The documents are transcriptions of early manuscripts, varying from continuous narrative (very long "paragraphs") to annals where an entry may be a TEI <p> element containing two words. Because we were using an SGML search tool (PAT), and the documents are well-marked with numbering systems and milestones, retrieving a fully-formed reference for each hit was, if not trivial, at least straightforward, so we could peg each hit as occurring in entry x at date y in para z and upwards through the chain of folios, pages, sections, etc. But that still left us with the problem of what context and how much context to display. A large amount of the text was very heavily marked with critical and analytical apparatus, with character data occurring (in extreme cases) up to 11 levels deep -- more if it was in a document embedded inside another, such as a letter quoted in its entirety. We used a crude dividing line between mixed content and element content: regardless of how deep the hit occurred, identify the closest ancestor which occurred in element content; if there was at least one sibling of the same type which contained character data (no matter how deep), then go no further; otherwise take the parent and try again. For display, the target content was stripped of markup and the first hit within it measured for its distance in characters from the start of its element-content ancestor container and the distance either side to the nearest sentence boundary (if such a thing was discernible). Ellipses were used to truncate fore and aft if necessary, so that no context more than (I think) 50 words would appear -- but in measuring this, we *did* trespass across parent boundaries when the hit was very close to the start or end of its element-content ancestor container, because the preceding or following element was regarded as important for the context. Extra conditions were applied when the hit was in an embedded document as mentioned above, so that it could be seen to be such; and for the occasional very small single-paragraph document (usually manuscript fragments). This seemed to work, and allowed scholars (the primary audience) to find the words they were looking for and easily discard those hits which weren't relevant for their purpose. It was also pretty slow, being coded in early CGI script form. PAT provided sub-second retrieval, but the subsequent poking around really chewed up the time. It failed miserably when it became clear that a large number of accesses were coming from Irish Americans (and others) searching for their family names, not realising that they would not occur in a recognisable form in 8th century Latin or Irish (and sometimes turning up words whose spelling was liable to be misconstrued if taken out of context :-) It was abandoned when we realised that the actual goal of the scholars was to identify the documents they wanted, and then download them for local use or just read them in their entirety in their browser. A lot of users did like seeing exactly where a hit occurred: it gave them confidence that the system was doing something meaningful and sensible; but the net result was always finding the right documents and reading or downloading them. We could have saved ourselves a lot of time by using grep on the stripped text :-) but, hey, it was a learning curve. Moral: identify the use cases first :-) ///Peter
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|