[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: I used XSLT streaming to generate a training corpus forEng
$)CHi Roger, interesting use case you worked on. I had to modify your (XSLT 3.0) stylesheet again for DataPower being able to process it in streaming mode: $ curl --data-binary @south_korea.osm http://firestar:2111 > out.xml % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 455M 0 1985k 100 453M 180k 41.2M 0:00:10 0:00:10 --:--:-- 46.9M $ $ xpath++ "count(/English-Korean/translation)" out.xml 15568 $ How long did the transformation took on your system? > It generated over 30,000 English-Korean pairs. > As can be seen above I did get 15568 translation nodes, what can be the difference? These are the first and last entries from my run: $ xpath++ "/English-Korean/translation[position() <= 3]" out.xml ------------------------------------------------------------------------------- <translation><English>Gimpo International Airport</English><Korean>김포국제공항</Korean></translation> ------------------------------------------------------------------------------- <translation><English>Incheon International Airport</English><Korean>인천국제공항</Korean></translation> ------------------------------------------------------------------------------- <translation><English>South Korea</English><Korean>대한민국</Korean></translation> $ $ xpath++ "/English-Korean/translation[position() >= last()-2]" out.xml ------------------------------------------------------------------------------- <translation><English>Jung-ang-dong Rotary</English><Korean>중앙동로터리</Korean></translation> ------------------------------------------------------------------------------- <translation><English>Odong Islet</English><Korean>오동도</Korean></translation> ------------------------------------------------------------------------------- <translation><English>To Sinwon, Hapcheon, Chunjeon</English><Korean>신원, 합천, 춘전방면</Korean></translation> $ Mit besten Gruessen / Best wishes, Hermann Stamm-Wilbrandt Level 3 support for XML Compiler team and Fixpack team lead WebSphere DataPower SOA Appliances https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/ https://twitter.com/HermannSW/ http://www.stamm-wilbrandt.de/ce/ ---------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschaeftsfuehrung: Dirk Wittkopp Sitz der Gesellschaft: Boeblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 |------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |"Costello, Roger L." <costello@mitre.org> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>, | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |09/21/2013 05:36 PM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| | I used XSLT streaming to generate a training corpus for English-Korean language translation | >--------------------------------------------------------------------------------------------------------------------------------------------------| Hi Folks, The Open Street Map XML file for South Korea http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2 is quite interesting. Each <node> element contains data about a thing (airport, university, office, bus stop, etc.) in South Korea. Within each <node> element is a <tag> element that shows the name of the thing in English and another <tag> element that shows its name in Korean. For example, this <node> element contains the name of an airport in English and Korean: <node lat="37.5582" lon="126.7906"> <tag k="name:en" v="Gimpo International Airport"/> <tag k="name:ko" v="1hFw19A&0xGW"/> </node> The English name is identified by @k="name:en" and the Korean name is identified by @k="name:ko" (@k means !.key!/ and @v means !.value!/). These pairs of values may be collected and then used to train an English-Korean language translator tool. The Open Street Map XML file is quite large -- 464 MB -- so I elected to extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT streaming program (see below) and ran it. It generated over 30,000 English-Korean pairs. Here is a sample of the output: <English-Korean> <translation> <English>Gimpo International Airport</English> <Korean>1hFw19A&0xGW</Korean> </translation> <translation> <English>Incheon International Airport</English> <Korean>@NC519A&0xGW</Korean> </translation> <translation> <English>South Korea</English> <Korean>4kGQ9N19</Korean> </translation> <translation> <English>Jeju-si</English> <Korean>A&AV=C</Korean> </translation> <translation> <English>Munui</English> <Korean>9.@G</Korean> </translation> <translation> <English>Bukcheon Junction</English> <Korean>:OC513Bw7N</Korean> </translation> !& <translation> <English>Odong Islet</English> <Korean>?@5?55</Korean> </translation> <translation> <English>To Sinwon, Hapcheon, Chunjeon</English> <Korean>=E?x, GUC5, Ca@|9f8i</Korean> </translation> </English-Korean> Here is the streaming XSLT program ------------------------------------------------------- generate-training-corpus.xsl ------------------------------------------------------- <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="3.0"> <xsl:output method="xml" /> <xsl:template match="/"> <xsl:stream href="../huge-file-Korea/south_korea.xml"> <English-Korean> <xsl:for-each select="osm"> <xsl:iterate select="node"> <xsl:variable name="thisNode" select="copy-of(.)"/> <xsl:if test="$thisNode[tag[@k eq 'name:en'] and tag[@k eq 'name:ko']]"> <translation> <English><xsl:value-of select= "$thisNode/tag[@k eq 'name:en']/@v" /></English> <Korean><xsl:value-of select="$thisNode/tag [@k eq 'name:ko']/@v" /></Korean> </translation> <xsl:next-iteration /> </xsl:if> </xsl:iterate> </xsl:for-each> </English-Korean> </xsl:stream> </xsl:template> </xsl:stylesheet> /Roger
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|