Re: deduplicating information in XML files

Play the video

Subject: Re: deduplicating information in XML files
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxx>
Date: Fri, 12 Oct 2012 09:11:25 -0400

Hi Robby,

You describe this as a de-duplication problem but as you present it,
it appears to be more of a merge.

And since you only need to do this once (correct?), it makes sense to
do it in stages, so that it can be checked along the way.

So I'd start by collecting all your FandB.xml files (use collection())
and create merged versions of them, something like:

<content>
  <was>
    <file>Product1_FandB.xml</file>
    <file>Product2_FandB.xml</file>
  </was>
  <meta>
    <id>product1</id>
    <id>product2</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Assuming you have no issues with whitespace etc., this is a
straightforward grouping operation. But notice I've including some
extra information in the 'was' element -- the names of the old files.
(Use document-uri() to get these.)

Now you have a static set of new FandB files, you can check them to
make sure they're good. (I bet you will discover there are things to
fix and possibly a need to go back, fix the sources, and run the merge
again.)

Once these are good, you need to reconstruct the links in your map
files. You can do this by indexing into a collection of the content
elements you just created, something like this:

<xsl:variable name="merged-contents"
select="collection('/path/pointing/to/newFandBcontents'"/>

<xsl:template match="features-benefits-rep">
  <xsl:copy>
    <xsl:attribute name="href"
select="document-uri($merged-contents[//was/file=current()]"/>
  </xsl:copy>
</xsl:template>

Once you're done with this (and checked the output to see that it's
correct) you can scub the 'was' element from your merged files, and
you're done.

I'd leave the meta/id elements in place. It's true that they are
redundant, but redundancy can be useful for QA. (Indeed, you could
write code to check them and confirm they are correct.)

Note: untested!

Cheers,
Wendell

On Fri, Oct 12, 2012 at 8:02 AM, Robby Pelssers <Robby.Pelssers@xxxxxxx>
wrote:
> Hi all,
>
> This time I have a rather challenging task at hand.  Let me first describe
the use case.  We have lots of product information stored in XML.  Some of
that information describes
> . Technical applications
> . Features and benefits
> . Technical summary
>
> One of the problems is a lot of products had e.g. the same features and
benefits as they are of the same product family or group.  But as we stored
that info per product it got duplicated.  Now we want to deduplicate that info
by generating DITA maps and topics (both are just XML).  Now for simplicity
let's assume we generate the following content for product1 and product2.  The
goal is to get from INPUT to OUTPUT by checking if the body of the linked
topics are duplicates, next create 1 generic topic and rewrite the links in
the map to  point to that single topic.  I have XSLT / XQuery (XMLDB) and Java
at my disposal to get the job done.  I'm not sure what will be the easiest way
to get the job done.  Keep also in mind that my INPUT will contain a few 1000
files (maps and linked topics) and I will need to deduplicate the whole set
;-)
>
> Thx upfront for any input,
> Robby
>
> INPUT
>
> Product1_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/Product1_FandB.xml "/>
> </map>
>
> Product1_FandB.xml:
> <content>
>   <meta>
>     <id>product1</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>
>
> Product2_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/Product2_FandB.xml "/>
> </map>
>
> Product2_FandB.xml:
> <content>
>   <meta>
>     <id>product2</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>
>
> Expected output:
>
> Product1_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
> </map>
>
> Product2_map.xml
> <map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
> </map>
>
> FandB_1.xml:
> <content>
>   <meta>
>     <id><!- can become empty  -> </id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
> </content>

--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^

Current Thread
deduplicating information in XML files Robby Pelssers - 12 Oct 2012 12:03:02 -0000 Wendell Piez - 12 Oct 2012 13:11:36 -0000 <= Message not available G. Ken Holman - 12 Oct 2012 14:21:06 -0000 G. Ken Holman - 12 Oct 2012 14:30:14 -0000 Robby Pelssers - 12 Oct 2012 14:35:28 -0000 G. Ken Holman - 13 Oct 2012 12:15:32 -0000

<- Previous	Index	Next ->
deduplicating information in , Robby Pelssers	Thread	Re: deduplicating information, G. Ken Holman
deduplicating information in , Robby Pelssers	Date	Re: deduplicating information, G. Ken Holman
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >