[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Here's how to process XML documents written in German
Hi Folks, Thanks to Wolfgang Laun for some German translations. Scenario: Your application has just received this XML document containing a contract written in German: <?xml version="1.0" encoding="UTF-8"?> <Kontrakt> <Posten währung="EUR">23.45</Posten> <Posten waÌhrung="EUR">45.00</Posten> <Posten währung="USD">39.99</Posten> <Posten>99.00</Posten> <Posten monetär-allianz="EUR">66.66</Posten> </Kontrakt> Your application wants to compute the sum of all the items (Posten) with currency (währung) in Euros. Clearly the result should be: 23.45 + 45.00 = 68.45 The application applies this XPath expression to the XML document: sum(//Posten[@währung eq 'EUR']) The output is this: 23.45 Wrong result! What happened? The XPath seems pretty straightforward: Give me the sum of all Posten elements that have an attribute währung equal to 'EUR'. We need to dig into this a bit to see exactly what is going on. First some background information: According to Unicode the character ä can be represented in these equivalent ways: 1. As just ä (this is called a precomposed character) 2. As a combination of 'a' plus a "combining diaeresis" character Visualization tools display both ways identically. So even though these two tags appear identical: währung="EUR" währung="EUR" inside the computer the bytes are very different: ä is represented in the computer as these bytes: C3 A4 'a' + combining diaeresis character is represented in the computer as these bytes: 61 CC 88 "So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one währung="EUR" but not the other. How do we design the XPath to find all occurrences of währung="EUR" regardless of the Unicode form that is used? Our XPath expression needs to express this: Give me the sum of all Posten elements that have an attribute whose name after normalization is währung and has a value equal to 'EUR'. This XPath expression does the job: sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('währung')][. eq 'EUR']]) The normalize-unicode() function converts an attribute name into a standard, canonical form. Lesson Learned: When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form. /Roger
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|