[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Dealing mixed content with invalid node-like text

Subject: Re: Dealing mixed content with invalid node-like text
From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx>
Date: Tue, 6 Dec 2011 19:22:08 -0500
Re:  Dealing mixed content with invalid node-like text
If the text is "almost" XML, perhaps the easiest thing to do would be
to fix it so it really is XML, then use a character map to output it
as-is so your second pass can just parse it normally.  If all you need
to do is escape the angle-brackets in something like "<1a .>", your
"tag-text" template could be as simple as:

<xsl:value-of select="replace($unparsed, '&lt;(\S+\s+\.)&gt;',
'&amp;lt;$1&amp;gt;')"/>

And you would have declarations such as this at the top level:

<xsl:output method="xml" version="1.0" encoding="utf-8"
use-character-maps="xmlout"/>
<xsl:character-map name="xmlout">
  <xsl:output-character character="&lt;" string="&lt;"/>
  <xsl:output-character character="&gt;" string="&gt;"/>
  <xsl:output-character character="&amp;" string="&amp;"/>
</xsl:character-map>

If you have other content being produced in the first pass, whose
correct output is threatened by this mapping, you may need to do some
additional replacements in your "tag-text" template, substituting
arbitrary characters (such as characters from the Unicode Private Use
area) for less-than, greater-than and ampersand, then adjusting the
character-map to map them back to their original forms.

This sort of markup hacking is not a road I'd recommend going down,
but if you have to do it, I can't really see a reason to do it in some
other language, if XSLT is what you're comfortable with.  Michael made
a good point about using a proper parser (which I wouldn't implement
in XSLT, as a first choice, even though it would be possible) if you
can put together a proper grammar for your input, but if a few regex
substitutions can get you safely to clean XML, the above approach may
suffice.

-Brandon :)


On Tue, Dec 6, 2011 at 5:42 PM, Karlmarx R <karlmarxr@xxxxxxxxx> wrote:
> Hello David,
>
> Yes, I do process the content in 2 stages, preprocess into one form of XML
and then further process that to my final XML form. BUT, BOTH are done in XSL
with one signle file and the problem that I reported is in first stage
conversion itself. To make things even more clear, here is a rough skeleton
and explanation of my process.I get the entire content of the input into a
variable $input-text, and then tokenize it to get each line of data into
another variable, as below.
>
> <xsl:variable name="lines" select="tokenize($input-text, '\r?\n')"/>
>
> <!--then pass it to another template to process each line of data:-->
> <xsl:call-template name="process-lines">
>                 <xsl:with-param name="lines" select="$lines"/>
> </xsl:call-template>
>
> <!-- And here, I  further process it to select the REQUIRED value, -->
> <xsl:template name="process-lines">
>                                 <xsl:param name="lines" as="xs:string*"/>
>
>                                 <xsl:for-each select="$lines">
>                                                 <xsl:variable
name="line-components" select="tokenize(.,'\t')"/>
>
>                                                   <xsl:for-each
select="$line-components[position() = last()]">
>                                                              <value>
>                                                                         
<xsl:call-template name="tag-text">
>
                                                                             
         <xsl:with-param name="unparsed" select="."/>
>                                                                          
</xsl:call-template>
>                                                               </value>
>                                                   </xsl:for-each>
>
>
> <!-- AND IT IS HERE in this "ag-text" template, I try to achieve  what I
explained in my original posting    -->
>  <xsl:template name="tag-text">
>        <xsl:param name="unparsed" required="yes"/>
>          <xsl:analyze-string select="$unparsed"
regex="^(.*?)<(.+)>(.*)</(.+)>(.*?)$">
>
>        etc as posted earlier.
>
> The skeleton input will be like (as I mentioned before):
>
> Line one text <b>within valid node</b> and like <II .> Title etc
> Line two with <1a .> Title etc, <i>within</i> <b>something</b> etc
> another line can be just normal text
> ....
>
> And it is vital here I get the data in the way I wanted, so that out final
output in stage two is correct. And inview of this I cannot use <value-of
select with d-o-e> here. As it seems this cannot be acheived by XSL (looks
likely) I am trying to get my source corrected. But if there are solution
available, in xsl or with better regex, I would be happy to use. I hope the
above clarifies your question.
>
> Thanks,
> Karl

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.