[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: broken text surrounding an entity I want to drop?

Subject: RE: broken text surrounding an entity I want to drop?
From: "Trevor Nicholls" <trevor@xxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Sep 2005 18:47:52 +1200
fix broken text
Thanks Mike and Joris for your comments.

How much text? If I run a text-only script over all the files I end up with
something of the order of 20Mb. Manual fixes are not an attractive idea (at
least not yet).

On balance, it seems to me that the frequency of

---
Text<a></a>
More text
---

is relatively low (maybe 5-10%) compared with

---
Text mo<a></a>
re text
---

so (accepting that a manual pass through is going to be necessary at some
point) I would rather attempt to automate the treatment of the commonest
case. We are still at a proof of concept stage, and broken words in every
other sentence don't look good! If we can reduce that to a few words here
and there we'll be much happier.

Thanks
Trevor

> -----Original Message-----
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Sent: Tue, 13 Sep 2005 08:58:09
Subject: RE:  broken text surrounding an entity I want to drop?

It helps to get the terminology right (it means people are more likely to
understand your question). You're using the terms "entity" and "tag" when
you mean "element".

You're dealing with dirty data, and data cleansing is always a rather
pragmatic affair. I don't think there's enough information in your source to
decide whether, in a case like

There is too much white<A></A>
space in this document

the author intended "whitespace" to be one word or two.

The only way you're going to be able to automate the data recovery is with
the help of a dictionary lookup, and even that will leave some ambiguities
like the one above.

How long is the text? My instinct would be to fix it by hand.

Michael Kay
http://www.saxonica.com/

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.