RE: broken text surrounding an entity I want to drop?

Play the video

Subject: RE: broken text surrounding an entity I want to drop?
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 13 Sep 2005 08:58:09 +0100

It helps to get the terminology right (it means people are more likely to
understand your question). You're using the terms "entity" and "tag" when
you mean "element".

You're dealing with dirty data, and data cleansing is always a rather
pragmatic affair. I don't think there's enough information in your source to
decide whether, in a case like

There is too much white<A></A>
space in this document

the author intended "whitespace" to be one word or two.

The only way you're going to be able to automate the data recovery is with
the help of a dictionary lookup, and even that will leave some ambiguities
like the one above.

How long is the text? My instinct would be to fix it by hand.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: Trevor Nicholls [mailto:trevor@xxxxxxxxxxxxxxxxxx]
> Sent: 13 September 2005 03:51
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject:  broken text surrounding an entity I want to drop?
>
> Hi
>
> My source XML file contains a myriad of <A id=something></A>
> entities which
> for the most part I wish to drop. I am using an identity
> template plus the
> following to do this:
>
> -----
> <!-- drop A tags which have no content -->
> <xsl:template match="A">
> <xsl:if test="* or text() or string(.)">
> <xsl:copy>
> <xsl:apply-templates select="@*|node()"/>
> </xsl:copy>
> </xsl:if>
> </xsl:template>
> -----
>
> Unfortunately and unsurprisingly this is too naove; it drops
> the <a> tags OK
> but leaves me with broken text. If I look at the input source
> in a text
> editor I can see that these tags are placed (arbitrarily so
> far as I can
> tell):
> a. between newlines
> b. after a newline and before text
> c. after text and before a newline
> d. between text strings
>
> In cases a and b my output preserves the newlines and my
> later transforms
> which normalize whitespace are fine.
> In cases c and d the tag may be in the middle of a word or at
> either end of
> a word. When at either end it is again not a problem as the output XML
> contains a newline which is normalized acceptably. My
> difficulty comes with
> this kind of input:
>
> -----
> <Body>
> <A ID="something"></A>
> To delete a n<A ID="something"></A>
> ode:</Body>
> -----
>
> Note that there are several thousand node types which can
> potentially hold
> this kind of text content, so writing a "Body" template to
> manage it isn't
> really feasible.
> Ideally this input should become
> -----
> <Body>To delete a node:</Body>
> -----
>
> But of course the transform I'm using isn't doing this, I'm getting
> -----
> <Body>To delete a n
> ode:</Body>
> -----
>
> My ham-fisted attempts to come up with templates which
> (a) *reliably* identify this situation, and
> (b) *don't* lead to my dropping huge screeds of wanted XML
> are failing miserably. This doesn't seem like a terribly unusual
> requirement, but I can't find an answer in the FAQ or my
> current set of
> books. I've also read the (otherwise helpful) "controlling whitespace"
> articles by Bob DuCharme on xml.com.
>
> Could somebody please point me towards the right technique to
> use here?
>
> Thanks
> Trevor

Current Thread
broken text surrounding an entity I want to drop? Trevor Nicholls - 13 Sep 2005 02:51:12 -0000 Joris Gillis - 13 Sep 2005 07:07:53 -0000 Michael Kay - 13 Sep 2005 07:58:33 -0000 <= <Possible follow-ups> Trevor Nicholls - 14 Sep 2005 06:48:06 -0000 Joris Gillis - 14 Sep 2005 07:41:11 -0000

<- Previous	Index	Next ->
Re: broken text surrounding a, Joris Gillis	Thread	RE: broken text surrounding a, Trevor Nicholls
RE: simple (hopefully) docume, Michael Kay	Date	First character in a word as , Dariusz Borowski
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >