[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Ingoring HTML - A Solution

Subject: RE: Ingoring HTML - A Solution
From: Jay Burgess <lists@xxxxxxxxxxx>
Date: Tue, 21 Jun 2005 08:01:38 -0700
xsl strip html
I thought I'd post a solution to my request last week to remove "HTML tags" from
a block of XML.  There may be a better way to do this, but this seems to work in
my case. Thanks for everyone's input.

<xsl:template name="strip-HTML">
    <xsl:param name="text"/>
    <xsl:choose>
        <xsl:when test="contains($text, '&gt;')">
            <xsl:choose>
                <xsl:when test="contains($text, '&lt;')">
                    <xsl:value-of select="substring-before($text, '&lt;')"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($text, '&gt;')"/>
                </xsl:otherwise>
            </xsl:choose>
            <xsl:call-template name="strip-HTML">
                <xsl:with-param name="text" select="substring-after($text,
'&gt;')"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Jay

| Jay Burgess [Vertical Technology Group]
| "Essential Technology Links via RSS"
| http://www.vtgroup.com/

> Re:  Ingoring HTML
> Subject: Re:  Ingoring HTML
> From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx>
> Date: Fri, 17 Jun 2005 13:39:59 -0700
> 
> On the dangerous side, I'd try something like:
> 
> perl -ne '$c.=$_;eof&&($c=~s/&lt;(([^<>](?!&lt;))*?)&gt;//sg&print$c);'
> foo.xml
> 
> Because it will probably be fine. For extra danger points, you can put
> it in a Makefile with no comment.
> 
> You should be able to do something similar with xsl, but of course this
> isn't very safe, and I think it would be a lot more complicated.
> 
> s/&lt;(([^<>](?!&lt;))*?)&gt;//sg;
> 
> This is '&lt;' some text '&gt;' with no intervening '&lt;', '<', or '>'
> replaced with nothing. I thought about actually trying to turn this
> content into xml, but note there's no close quote on that style
> attribute! Watch out!
> 
> sdc
> 
> On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
> On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> > I apologize if this is in the FAQ, but I've searched and can't find it.  (I'm
> > kind of new to XSL, so I may just have not seen it.)
> 
> This is a faq of sorts, but I had a little bit of a difficult time
> finding an answer to it in Dave Pawson's FAQ as well.  Of course, I
> just did a quick glance.  I'd recommend skimming the the CDATA section
> as well.
> 
> > 
> > I've got some XML that contains HTML-formatted text.  For example:
> > 
> > <title>&lt;SPAN style="font-size: 13pt; font-family: Verdana; &gt;The
> > &lt;b&gt;Text&lt;/b&gt; That I Want&lt;/SPAN&gt;</title>
> > 
> 
> "HTML-formatted text" is a little bit nonsensical.  HTML itself says
> that &lt; is meant as a stand-in for <, so when you have it it's not a
> tag.  Since namespaces were rather slow to get off to start, we ended
> up seeing people put so-called "HTML" in XML *cough* RSS *cough*.  But
> to any XML application, this is one big chunk of text.
> 
> So, some possible advice:
> 
> 1) if you can change the input format so that it uses namespaces and
> actually embeds real XHTML into the documents you're creating, do so. 
> Or at least have it be an option.
> 
> 2) If you can't do that, I'm sure you can find a more general solution
> if you hunt through the archives.  The essential solution will
> probably be along the lines of looking for &lt; and &gt;s and throwing
> any text in them out via some of the XPATH/XSLT string functions. 
> Might be much easier with XSLT 2.0
> 
> 3) It may be possible with a combination of d-o-e and doing multiple
> transformations, regex scripting or other techniques to replace the
> various &lt; and &gt; in certain elements but not others, then
> reprocess that document through your final stylesheet.  Of couse, this
> makes it slightly dangerous.
> 
> Dig through the archives there might be a more general solution
> already done or someone else will be able to give you one instead of
> just giving you some ranting.  (I blame Friday afternoon and a slow
> server for my current long-winded explanation why this type of
> embedding is evil).
> 
> Short answer, it's probably not difficult as long as it's relatively
> straightforward.  If the "html" inside the xml is complex at all or
> you are using &lt; in other places, you might have difficulty.
> 
> Extremely simple if you can just have the input source use namespaces
> and you're comfortable with how XSLT deals with namespaces.
> 
> Jon Gorman

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.