[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: creating a collection from an archive

Subject: Re: creating a collection from an archive
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 19 Apr 2018 21:09:23 -0000
Re:  creating a collection from an archive
Try renaming the .docx file with a .jar or .zip file extension and then using
it directly as the collection URI - Saxon should recognize it and give you
access to the contained files as a collection.

If that works, you could register your own CollectionFinder that subclasses
the StandardCollectionFinder and overrides the method isJarFileURI() to
recognize the file extension ".docx".

You can then either use collection() function to get the set of documents in
the ZIP file, or you can use uri-collection() to get their URIs, in a form
that you can supply as arguments to the doc() function.

You may also need to do something like
Configuration.registerFileExtension("doc", "application/xml") so that .doc
files are recognized as containing XML. Generally there's a lot of powerful
machinery in Saxon for customizing the way collections are handled.

Michael Kay
Saxonica

> On 19 Apr 2018, at 20:07, Graydon graydon@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> So I have a Word document, localtest.docx, which is in the 2016 strict
> version of the OOXML standard.  As such, it's a zip archive of a bunch
> of XML files.  I want to apply XSLT to the XML files.
>
> I could use the arch module and the collection function to write the whole
> thing to disk and then load it from disk as a collection before doing
whatever
> to it and writing it to disk as an archive again, but this seems
inefficient.
> It would be better to read the archive into an in-memory collection,
manipulate
> it, and then write that back out as an archive.
>
> I'm using XSLT 3.0 via Saxon 9.8.0.8 in oXygen.
>
> <xsl:variable name="wordArchive" as="document-node()+">
>   <xsl:variable name="arch" select="file:read-binary($wordArchiveURI)"/>
>   <xsl:variable name="entries" select="arch:entries($arch)"/>
>   <xsl:variable name="dirs" select="$entries[ends-with(.,'/')]"/>
>   <xsl:sequence select="for $x in ($entries except $dirs)
>                      return arch:extract-text($arch,$x) => parse-xml()" />
> </xsl:variable>
>
> works, in that I get a sequence of document nodes and those documents have
the
> expected XML content.
>
> I don't get document nodes with associated document-uri() values or any of
the
> rest of the archive structure.  Those URIs are in the values returned by
> arch:entries but I'm not seeing how I assign a document-uri value to a
document
> node.  xsl:document doesn't seem to have a facility for assigning a
> document-uri value and of course you can't create an attribute whose parent
is
> a document node even if document-uri was an attribute in the first place.
>
> What I want is a collection where the structure matches the Word archive,
> various subdirectories and all, and I can use the doc() function to access
> various compontent documents.  I can't shake the feeling that I'm missing
> something obvious, but this feeling is no help in discerning what the
obvious
> thing is!
>
> Thanks!
> Graydon

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.