[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

deduplicating information in XML files

Subject: deduplicating information in XML files
From: Robby Pelssers <Robby.Pelssers@xxxxxxx>
Date: Fri, 12 Oct 2012 14:02:40 +0200
 deduplicating information in XML files
Hi all,

This time I have a rather challenging task at hand.  Let me first describe the
use case.  We have lots of product information stored in XML.  Some of that
information describes
. Technical applications
. Features and benefits
. Technical summary

One of the problems is a lot of products had e.g. the same features and
benefits as they are of the same product family or group.  But as we stored
that info per product it got duplicated.  Now we want to deduplicate that info
by generating DITA maps and topics (both are just XML).  Now for simplicity
let's assume we generate the following content for product1 and product2.  The
goal is to get from INPUT to OUTPUT by checking if the body of the linked
topics are duplicates, next create 1 generic topic and rewrite the links in
the map to  point to that single topic.  I have XSLT / XQuery (XMLDB) and Java
at my disposal to get the job done.  I'm not sure what will be the easiest way
to get the job done.  Keep also in mind that my INPUT will contain a few 1000
files (maps and linked topics) and I will need to deduplicate the whole set
;-)

Thx upfront for any input,
Robby  

INPUT

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product1_FandB.xml "/>
</map>

Product1_FandB.xml:
<content>
  <meta>
    <id>product1</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product2_FandB.xml "/>
</map>

Product2_FandB.xml:
<content>
  <meta>
    <id>product2</id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Expected output:

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

FandB_1.xml:
<content>
  <meta>
    <id><!- can become empty  -> </id>
  <meta>
  <body>
    <p>Suitable for high frequency applications due to fast switching
characteristics</p>
    <p>Suitable for logic level gate drive sources</p>
  <body>
</content>

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.