RE: storing XML files

From: J C Theriot <theriot@p...>
To: Dave Simmonds <daves@t...>, xml-dev <xml-dev@l...>
Date: Wed, 10 Oct 2001 12:27:31 -0500

Play the video

Dave,

Since elements may be recursive (element Foo may be a valid descendent of
element Foo) and can have elements with unbounded occurrences, in the
general case it is not possible to physically represent the document type as
a finite number of rows in an XPath table (or even by an infinite number of
rows of finite size). This might be the case with the 'previously unknown'
types, even if the 'known' document types are more simple.

On the other hand, since a given document always has a finite number of data
values, it can always be represented by a finite number of xpath/value
pairs;  this requires no knowledge of the schema or dtd (other than what is
needed to parse the document). You can still share the valid xpaths of a
document type across valid documents of the document type, but each time you
read a new document there is the possibility that you will need to create
more xpath records for its document type (for example, if previous documents
had 'Foo' elements nested up to 5 deep, and the current document has them
nested 6 deep).

Jim
---------------------------
Jim Theriot
mailto:Jim.Theriot@p...
POSC -- Energy eStandards
9801 Westheimer, Suite 450
Houston TX USA 77042
+1 713 267 5109 : phone
+1 713 784 9219 : fax
---------------------------

From: Dave Simmonds [mailto:daves@t...]
Sent: Wednesday, October 10, 2001 10:57 AM
To: xml-dev
Subject: RE:  storing XML files

I honestly don't know if someone has come up with this before, so please
enlighten me as appropriate.  I've been given the task of mapping XML into
an Access database format.  The main requirement is that the database
structure must not change significantly as new previously unknown types of
XML documents are added.  This requirement in place so that we can create a
relatively simple client application that can count on a certain db
structure.

In solving this problem I've come up with a base of approximantly 5 tables
that are needed.  In order to accomplish this task - each XML document must
be able to validate against a Schema (or DTD) and the db must support
crosstab/pivot type queries.

Here are the five tables.

The basic concept is the create a unique id for each possible piece of data
that can be stored relative to any given Schema.  Since an XPath can provide
this mapping we simply parse the Schema to get our XPath's and map it to our
id.  Each XPath can have any set of constraints specified and a set of
possible values associated (enumeration) with it.  With this setup, a
document simply becomes a simple 1 to many relationship - 1 document has
many XPath's - each XPath with a correspnding value.
With this structure, when a new type of document comes along all we need to
do is parse it's Schema and then add the document.  The underlying db
structure won't need to change.
The example above doesn't show all the possible constraints that can be put
on any given XPath, but could easily be extended to do so.
As far as the SQL is concerned, if we want to get a recordset of all objects
of 1 particular type (Schema) we perform a crosstab/pivot query that will
give us columns of XPaths filled will the corresponding values.  In other
words, 1 row in the recordset becomes 1 document.
With this method, I believe it's possible to map any well formed xml with a
DTD/Schema to an RDBMS.
Dave

-----Original Message-----
From: Ronald Bourret [mailto:rpbourret@r...]
Sent: Wednesday, October 10, 2001 1:13 AM
To: Albena Georgieva; xml-dev
Subject: Re:  storing XML files

Albena Georgieva wrote:
> Ronald: What do you mean by "semi-structured"_ness of the data?

Structured data is very rigid. That is, all records have the same
fields. An example of structured data is a telephone book -- every entry
has a name, an address, and a telephone number.

Semi-structured data is data that has some structure, but is not rigidly
structured. An example of semi-structured data is a health record. For
example, one patient might have a list of vaccinations, another might
have height and weight, another might have a list of operations they
have had. Other examples of semi-structured data are legal documents and
geneological records.

A common example in industry is data that has been retrieved from many
different sources. For example, if you ask for all the data about
customer X, you might get a sales history, emails, stock profile, and so
on. However, the set of data will differ greatly from customer to
customer.

Semi-structured data is difficult to store in a relational database
because it means you either have many different tables (which means many
joins and slow retrieval time) or a single table with many null columns.
Semi-structured data is very easy to store as XML and is a good fit for
a native XML database.

> <!--     EXAMPLE -->
>
> <?xml version="1.0" encoding="ISO-8859-1" ?>
> <!DOCTYPE nitf SYSTEM "nitf-adjusted-13c.dtd">
> <nitf>
> <head>
> <meta name="onlinefolder" content="Entertainment"></meta>
> <docdata>
> <date.issue norm="20010302 150629+0100"></date.issue>
> <key-list>
>                 <keyword key="Maxima"></keyword>
>                 <keyword key="Princess"></keyword>
>         </key-list>
> </docdata>
> </head>
> <body>
> <body.head>
> <hedline><hl1>Maxima speaks perfect dutch</hl1></hedline>
> </body.head>
> <body.content>
> <p>Princess Maxima presented the Princess Collection 2000 in an estate
with
> a very royal ambiance. The dazzling collection is also presented in the
new
> catalogue full of show, glitter and glamour.</p>
> </body.content>
> </body>
> </nitf>
>
> <!--     /EXAMPLE -->
>
> All I want to do with them for now, is to save them in a RDBMS. The rest
of
> the applications (ASPs or servlets) will access that data through ODBC or
> JDBC.
...
> If the applications access the datafeeds through ODBC or JDBC and they
don't
> ask for XML format, ( no need for XML retrieval ) I see no reason for
> introducing a XML native database at this point, but please correct me I
am
> wrong ...

These two requirements -- storing NITF in a relational database and not
returning any XML -- conflict with each other. If you don't want to
return XML, that means no XML can be stored in the database. But if you
map NITF to a relational database using the most widely accepted mapping
that doesn't store XML in the database (an object relational mapping),
you will end up with an almost useless set of tables.

The problem is that NITF uses mixed content and mixed content doesn't
map well with an object-relational mapping. (I won't go into the details
here. If you want to read more about this, see sections 3.3 and 3.4 of
http://www.xml.com/pub/a/2001/05/09/dtdtodbs.html.)

Furthermore, as far as I know, the only product that supports mixed
content in an object-relational mapping is mine (XML-DBMS) and I
recommend that people don't use XML-DBMS with mixed content because it
is so inefficient. (XML-DBMS started out as a research project and I
supported mixed content for completeness. Were I to do it over again, I
probably wouldn't support it.)

Your choices therefore are:

1) Store your documents in a native XML database.

2) Store your documents as BLOBs in a relational database and index them
as Soumitra suggests. (For an example of this technique, see the
discussion of side tables in section 6.2 of
http://www.rpbourret.com/xml/XMLAndDatabases.htm.)

3) Store the documents in a relational database with object-relational
middleware. However, instead of storing mixed content in multiple
tables, store it as XML in a single column. For example, you would store
the content of the <body.content> element in a single column as:

   <p>Princess Maxima presented the Princess Collection 2000
   in an estate with a very royal ambiance. The dazzling
   collection is also presented in the new catalogue full
   of show, glitter and glamour.</p>

Notice the <p> elements. Not all middleware products support this
technique, but some do. Another problem with this choice is that it is
not clear if non-mixed-content parts of the NITF DTD can even be mapped
to a useful set of tables. With some transformations, it might be
possible, but I'm not sure.

In any case, your applications are almost certainly going to have to
deal with some XML. Otherwise, there doesn't seem to be any way to deal
with the mixed content.

> I just need a good way to transform a XML data feeds into relational
> database model. That is why, I asked for something like DTD2SQL or XML2SQL
> tool. So, I imagined applying XSLT transformations to every different
> datafeed to transform it into some DTD for SQL and then just run that SQL.

This is possible. That is, you could write an XSLT transformation that
transformed your documents into a series of tables. For example,
transform a document with the form:

   <table1.row>
      <table1.column1>...</table1.column1>
      <table1.column2>...</table1.column2>
      <table2.row>
         <table2.column1>...</table2.column1>
         <table2.column2>...</table2.column2>
         <table2.column3>...</table2.column3>
      </table2.row>
   </table1.row>

to the form:

   <table1.row>
      <table1.column1>...</table1.column1>
      <table1.column2>...</table1.column2>
   </table1.row>
   <table2.row table1.key="...">
      <table2.column1>...</table2.column1>
      <table2.column2>...</table2.column2>
      <table2.column3>...</table2.column3>
   </table2.row>

It would then be easy to insert these rows into tables. (Note that there
is no need for a standard DTD here. The correspondence of elements to
tables and columns is inherent in the structure. To be truly generic,
all that would be needed is a simple mapping file that mapped element
and attribute names to table and column names.)

The data transfer code is easy to write, although the stylesheets
probably aren't. (I'm actually surprised that nobody has written a
product that takes an XML-to-DBMS map and generates the above
stylesheets. It wouldn't be the most efficient way to transfer data due
to the XSLT transform, but it has the advantage of simplicity.)

In any case, the code and the stylesheets aren't the problem. The
problem is the one mentioned above -- that the NITF DTD simply doesn't
map well to relational schema due to its mixed content.

-- Ron

-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
initiative of OASIS <http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.xml.org/ob/adm.pl>

References:
- RE: storing XML files
  - From: Dave Simmonds <daves@t...>

Prev by Date: RE: storing XML files
Next by Date: RE: Re: W3C ridiculous new policy on patents
Previous by thread: RE: storing XML files
Next by thread: Re: storing XML files
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >