Paul RaynerSubject: Cleaning up Word 2007 xml
Author: Paul Rayner
Date: 26 Jan 2009 10:20 AM

I'm currently evaluating stylusstudio and marklogic.com's XML server with a view to using one or both within our company.

We need to clean up word XMl in several thousand documents, and I noticed a script at:


which makes a start towards this by combining 'runs'.

I am trying to make this script work within Stylus Studio, and have some problems I hope someone can help me with.

At the bottom of thispost I have pasted my modified version of the script. It currently produces the following error: XPTY0004, at line 103:24, which is the space after '$this' on the return statement of the local:map function.

I'm really just getting started with this, and hope to end up generating java code which can be run over a set of documents whenever needed. I'd appreciate any help anyone can give me.




declare namespace w="http://schemas.openxmlformats.org/wordprocessingml/2006/main";

declare function local:ml-update-document-xml($doc as element(w:document)) as element(w:document)




declare function local:passthru($x as node()) as node()


for $i in $x/node() return local:dispatch($i)


declare function local:dispatch ($x as node()) as node()


typeswitch ($x)

case element(w:p) return local:mergeruns($x)

default return (

element{fn:name($x)} {$x/@*,local:passthru($x)}



declare function local:mergeruns($p as element(w:p)) as element(w:p)


let $pPrvals := if(fn:exists($p/w:pPr)) then $p/w:pPr else ()

return element w:p{ $pPrvals, local:map($p/w:r[1]) }


declare function local:descend($r as element(w:r)?, $rToCheck as element(w:rPr)?) as element(w:r)*


if(fn:empty($r)) then ()

else if(fn:deep-equal($r/w:rPr,$rToCheck)) then

($r, local:descend($r/following-sibling::w:r[1], $rToCheck))

else ()


declare function local:map($r as element(w:r)?) as element(w:r)


if (fn:empty ($r)) then ()


let $rToCheck := $r/w:rPr

let $matches := local:descend($r/following-sibling::w:r[1], $rToCheck)

let $count := fn:count($matches)

let $this := if ($count) then

(element w:r{ $rToCheck,

element w:t { fn:string-join(($r/w:t, $matches/w:t),"") } })

else $r

return ($this, local:map( if($count) then ($r/following-sibling::w:r[1 + $count]) else $r/following-sibling::w:r[1]))


let $document :=

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">







<w:i />


<w:t>Doctor Paul Pr</w:t>




<w:i />


<w:t>oteus, the man with the highe</w:t>




<w:i />


<w:t>st income in Ilium, drove his cheap and old Plymouth across the bridge to Homestead. </w:t>




return local:ml-update-document-xml($document)

Minollo I.Subject: Cleaning up Word 2007 xml
Author: Minollo I.
Date: 26 Jan 2009 10:55 AM
I didn't try going through the whole logic of the XQuery; but to make it correct against the static typing checks that DataDirect XQuery does, you need two changes:

declare function local:ml-update-document-xml($doc as element(w:document)) as element(w:document)
local:dispatch($doc) treat as element(w:document)

[Note the "treat as" to force a cast from node() to element(w:document); in alternative you can change the return type to just node()]


declare function local:map($r as element(w:r)?) as element(w:r)?
if (fn:empty ($r)) then ()
let $rToCheck := $r/w:rPr
let $matches := local:descend($r/following-sibling::w:r[1], $rToCheck)
let $count := fn:count($matches)
let $this := if ($count) then
(element w:r{ $rToCheck,
element w:t { fn:string-join(($r/w:t/string(), $matches/w:t/string()),"") } })
else $r
return ($this, local:map( if($count) then ($r/following-sibling::w:r[1 + $count]) else $r/following-sibling::w:r[1]))

[Note that the return type can be an empty sequence, which implies you need to add a "?" to the return type; and the explicit use of string() when you use string-join()]

Paul RaynerSubject: Cleaning up Word 2007 xml
Author: Paul Rayner
Date: 27 Jan 2009 03:35 AM
Thank You - that works on a simple document embedded in the query. I'm now going through the code to make it work on a full word XML document.

Just a thought, has anyone done this before? Are there any existing scripts for cleaning up the rubbish in word XML files using Stylus Studio?

