[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: I used XSLT streaming to generate a training corpus forEng
Great Example, Roger, Maybe you could try processing bigger files and different kinds of processing. One example is a periodic processing of the current and previous-current file and determining all latest changes that occurred during this period. Then producing change-documents by region. This involves a synchronized (double) streaming and would be both challenging and instructive. Cheers, Dimitre On Sat, Sep 21, 2013 at 8:34 AM, Costello, Roger L. <costello@mitre.org> wrote: > Hi Folks, > > > > The Open Street Map XML file for South Korea > > > > http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2 > > is quite interesting. Each <node> element contains data about a thing > (airport, university, office, bus stop, etc.) in South Korea. Within each > <node> element is a <tag> element that shows the name of the thing in > English and another <tag> element that shows its name in Korean. For > example, this <node> element contains the name of an airport in English and > Korean: > > > > <node lat="37.5582" lon="126.7906"> > <tag k="name:en" v="Gimpo International Airport"/> > <tag k="name:ko" v="ê¹í¬êµì ê³µí"/> > </node> > > > > The English name is identified by @k="name:en" and the Korean name is > identified by @k="name:ko" (@k means âkeyâ and @v means âvalueâ). > > > > These pairs of values may be collected and then used to train an > English-Korean language translator tool. > > > > The Open Street Map XML file is quite large -- 464 MB -- so I elected to > extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT > streaming program (see below) and ran it. It generated over 30,000 > English-Korean pairs. Here is a sample of the output: > > > > <English-Korean> > <translation> > <English>Gimpo International Airport</English> > <Korean>ê¹í¬êµì ê³µí</Korean> > </translation> > <translation> > <English>Incheon International Airport</English> > <Korean>ì¸ì²êµì ê³µí</Korean> > </translation> > <translation> > <English>South Korea</English> > <Korean>ëí민êµ</Korean> > </translation> > <translation> > <English>Jeju-si</English> > <Korean>ì 주ì</Korean> > </translation> > <translation> > <English>Munui</English> > <Korean>문ì</Korean> > </translation> > <translation> > <English>Bukcheon Junction</English> > <Korean>ë¶ì²êµì°¨ë¡</Korean> > </translation> > > ⦠> <translation> > <English>Odong Islet</English> > <Korean>ì¤ëë</Korean> > </translation> > <translation> > <English>To Sinwon, Hapcheon, Chunjeon</English> > <Korean>ì ì, í©ì², ì¶ì ë°©ë©´</Korean> > </translation> > </English-Korean> > > Here is the streaming XSLT program > > ------------------------------------------------------- > > generate-training-corpus.xsl > > ------------------------------------------------------- > > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:xs="http://www.w3.org/2001/XMLSchema" > exclude-result-prefixes="#all" > version="3.0"> > > <xsl:output method="xml" /> > > <xsl:template match="/"> > <xsl:stream href="../huge-file-Korea/south_korea.xml"> > <English-Korean> > <xsl:for-each select="osm"> > <xsl:iterate select="node"> > <xsl:variable name="thisNode" select="copy-of(.)"/> > <xsl:if test="$thisNode[tag[@k eq 'name:en'] and > tag[@k eq 'name:ko']]"> > <translation> > <English><xsl:value-of > select="$thisNode/tag[@k eq 'name:en']/@v" /></English> > <Korean><xsl:value-of > select="$thisNode/tag[@k eq 'name:ko']/@v" /></Korean> > </translation> > <xsl:next-iteration /> > </xsl:if> > </xsl:iterate> > </xsl:for-each> > </English-Korean> > </xsl:stream> > </xsl:template> > > </xsl:stylesheet> > > > > /Roger -- Cheers, Dimitre Novatchev --------------------------------------- Truly great madness cannot be achieved without significant intelligence. --------------------------------------- To invent, you need a good imagination and a pile of junk ------------------------------------- Never fight an inanimate object ------------------------------------- To avoid situations in which you might make mistakes may be the biggest mistake of all ------------------------------------ Quality means doing it right when no one is looking. ------------------------------------- You've achieved success in your field when you don't know whether what you're doing is work or play ------------------------------------- Facts do not cease to exist because they are ignored. ------------------------------------- Typing monkeys will write all Shakespeare's works in 200yrs.Will they write all patents, too? :) ------------------------------------- I finally figured out the only reason to be alive is to enjoy it.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|