On 6/13/2023 12:48 AM, Manuel Souto Pico terminolator@xxxxxxxxx wrote:
I'm trying to convert a collection of XLIFF files into TMX. The files
contain some HTML named entities, which makes my stylesheet choke:
My question is: Is there any way I can avoid or fix this problem from
the XSLT stylesheet without having to modify the input XLIFF files?
The example above is with ndash but I believe there must be many HTM
named entities in the files.
David Carlisle wrote an HTML tag soup parser in XSLT 2
(https://github.com/davidcarlisle/web-xslt/blob/main/htmlparse/htmlparse.xsl)
that knows all the named entities and can also be used as an XML parser
knowing those entities so if you use/import his stylesheet and use its
function instead of normal XML parsing, as in
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
B version="3.0"
B xmlns:xs="http://www.w3.org/2001/XMLSchema"
B xmlns:d="data:,dpc"
B exclude-result-prefixes="#all"
B expand-text="yes">
B <xsl:import
href="https://raw.githubusercontent.com/davidcarlisle/web-xslt/main/htmlparse
/htmlparse.xsl"/>
B <xsl:param name="xml-uri" as="xs:string" select="'sample1.xml'"/>
B <xsl:mode on-no-match="shallow-copy"/>
B <xsl:template name="xsl:initial-template">
B B B <xsl:apply-templates select="unparsed-text($xml-uri) =>
d:htmlparse('', false())"/>
B </xsl:template>
</xsl:stylesheet>
the named entity references should be parsed into the corresponding
characters (and you can process all nodes by adding any templates you
need/have/want to transform the XML). So the above assumes starting e.g.
Saxon 9.8 or later with `-it` for the initial template.
|