[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Finding "unknown" character references

Subject: Finding "unknown" character references
From: "Huditsch Roman" <Roman.Huditsch@xxxxxxxxxxxxx>
Date: Tue, 22 Nov 2005 13:53:21 +0100
unknown character
Hi,

I need to check for "unknown" character references within my XML files.
All valid character references are stored in another XML file
("invalid_chars.xml") :

<?xml version="1.0" encoding="ISO-8859-1"?>
<chars>
   <char>&#8222;</char>
   <char>&#8218;</char>
   <char>&#353;</char>
   ...
</chars>

Everytime a special character, which is unknown to this reference file,
is encountered,
it should be outputted.


What I've come up with is:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="ISO-8859-1" indent="yes"/>

    <xsl:template name="init">
        <xsl:variable name="source" select="'../dirlist.txt'"/>
        <xsl:variable name="encoding" select="'iso-8859-1'"/>
	<xsl:variable name="src">
	    <errors>
	       <xsl:for-each select="tokenize(unparsed-text($source,
$encoding), '\r?\n')">
		<file>
		    <xsl:value-of select="."/>
		</file>
	        </xsl:for-each>
	     </errors>
	</xsl:variable>
	<xsl:result-document href="invalid_chars.xml">
	    <errors>
	         <xsl:apply-templates select="$src//file[text()]"/>
	     </errors>
	</xsl:result-document>
      </xsl:template>

      <xsl:template match="file">
	<xsl:analyze-string select="."
regex="^([0-9]{{2}}.[0-9]{{2}}.[0-9]{{4}})\p{{Zs}}+([0-9]{{2}}:[0-9]{{2}
})\p{{Zs}}+([0-9.]+)(.*)$" flags="ix">
    	      <xsl:matching-substring>
	            <xsl:variable name="docpath"
select="normalize-space(regex-group(4))"/>
	             <xsl:variable name="errors" as="element()*">

			<!-- Here my problems begin -->

			<xsl:if
test="document($docpath)//text()[contains(., '&amp;')]">
		   	      <xsl:analyze-string select="."
regex="&amp;[#a-zA-Z0-9]+" flags="i">
				<xsl:matching-substring>
			  	     <xsl:if
test="document('valid_cahrs.xml')//char=concat(regex-group(0), ';')">
					<file>
					    <char>
					         <xsl:value-of
select="."/>
					         <xsl:text>;</xsl:text>
					    </char>
					</file>
				      </xsl:if>
				</xsl:matching-substring>
			        </xsl:analyze-string>
			  </xsl:if>
		      </xsl:variable>

		      <xsl:for-each-group select="$errors"
group-by="@name">
			<file name="{current-grouping-key()}">
		   	     <xsl:for-each select="current-group()">
				<xsl:copy-of select="*"/>
			      </xsl:for-each>
			</file>
		       </xsl:for-each-group>
		</xsl:matching-substring>
	        <xsl:non-matching-substring/>
	   </xsl:analyze-string>
       </xsl:template>
</xsl:stylesheet>


Example XML:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
     <absatz> &#8222;dyplom licencjata piel&#281;gniarstwa&#8220;
</absatz>
</root>


How can I check, if a text node contains a character unknown to my
valid_chars.xml?
Do you have any ideas, how I can make my XSLT work?
Thank you very much.

wbr,
Roman

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.