Here's how to process XML documents written in German

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Wed, 30 Jan 2013 18:47:05 +0000

Play the video

Hi Folks,

Thanks to Wolfgang Laun for some German translations. 

Scenario: Your application has just received this XML document containing a contract written in German:

	<?xml version="1.0" encoding="UTF-8"?>
	<Kontrakt>
    	      <Posten wÃ¤hrung="EUR">23.45</Posten>
    	      <Posten waÌhrung="EUR">45.00</Posten>
    	      <Posten wÃ¤hrung="USD">39.99</Posten>
    	      <Posten>99.00</Posten>
    	      <Posten monetÃ¤r-allianz="EUR">66.66</Posten>
	</Kontrakt>

Your application wants to compute the sum of all the items (Posten) with currency (wÃ¤hrung) in Euros. 

Clearly the result should be:

	23.45 + 45.00 = 68.45

The application applies this XPath expression to the XML document:

	sum(//Posten[@wÃ¤hrung eq 'EUR'])

The output is this:

	23.45

Wrong result! 

What happened? 

The XPath seems pretty straightforward:

   	Give me the sum of all Posten
	elements that have an attribute 
	wÃ¤hrung equal to 'EUR'.

We need to dig into this a bit to see exactly what is going on.

First some background information:

According to Unicode the character Ã¤ can be represented in these equivalent ways:

1. As just Ã¤ (this is called a precomposed character)

2. As a combination of 'a' plus a "combining diaeresis" character

Visualization tools display both ways identically. 

So even though these two tags appear identical:

	wÃ¤hrung="EUR"
	wÃ¤hrung="EUR"

inside the computer the bytes are very different:

	Ã¤ is represented in the computer as 
	these bytes: C3 A4

	'a' + combining diaeresis character is 
	represented in the computer as these 
	bytes: 61 CC 88

"So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one wÃ¤hrung="EUR" but not the other. 

How do we design the XPath to find all occurrences of wÃ¤hrung="EUR" regardless of the Unicode form that is used?

Our XPath expression needs to express this:

   	Give me the sum of all Posten
	elements that have an attribute 
	whose name after normalization 
	is wÃ¤hrung and has a value equal 
   	to 'EUR'.

This XPath expression does the job:

sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('wÃ¤hrung')][. eq 'EUR']])

The normalize-unicode() function converts an attribute name into a standard, canonical form.

Lesson Learned:

When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form.

/Roger

Follow-Ups:
- Re: Here's how to process XML documents written in German
  - From: "Tony Graham" <tgraham@mentea.net>
- Re: Here's how to process XML documents written in German
  - From: Michael Kay <mike@saxonica.com>
- Re: Here's how to process XML documents written in German
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >