|
[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: broken text surrounding an entity I want to drop?
Thanks Mike and Joris for your comments. How much text? If I run a text-only script over all the files I end up with something of the order of 20Mb. Manual fixes are not an attractive idea (at least not yet). On balance, it seems to me that the frequency of --- Text<a></a> More text --- is relatively low (maybe 5-10%) compared with --- Text mo<a></a> re text --- so (accepting that a manual pass through is going to be necessary at some point) I would rather attempt to automate the treatment of the commonest case. We are still at a proof of concept stage, and broken words in every other sentence don't look good! If we can reduce that to a few words here and there we'll be much happier. Thanks Trevor > -----Original Message----- From: "Michael Kay" <mike@xxxxxxxxxxxx> Sent: Tue, 13 Sep 2005 08:58:09 Subject: RE: broken text surrounding an entity I want to drop? It helps to get the terminology right (it means people are more likely to understand your question). You're using the terms "entity" and "tag" when you mean "element". You're dealing with dirty data, and data cleansing is always a rather pragmatic affair. I don't think there's enough information in your source to decide whether, in a case like There is too much white<A></A> space in this document the author intended "whitespace" to be one word or two. The only way you're going to be able to automate the data recovery is with the help of a dictionary lookup, and even that will leave some ambiguities like the one above. How long is the text? My instinct would be to fix it by hand. Michael Kay http://www.saxonica.com/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|

Cart








