[Home] [By Thread] [By Date] [Recent Entries]
On Fri, Mar 05, 1999 at 09:37:29AM -0800, Jerome McDonough wrote: > At 02:17 PM 3/5/1999 +1100, Marcelo Cantos wrote: > >>"Jeffrey E. Sussna" wrote: > >> > >> There is not (AFAIK) yet any such thing as an XDBMS (though you > >> could consider a file system of XML documements plus a web server > >> to resolve URL's to those documents as such a thing). > > > >I am continually surprised to hear remarks such as this. SIM _is_ > >an XDBMS (it is also an SGML, MARC, RTF, etc. database with > >structure and full content query capabilities). > > I think one of the reasons you hear these kinds of remarks is that > the terminology surrounding these systems is used differently by > different folks. For instance, from what I know of SIM, I wouldn't > call it a DBMS system of any kind, as I don't believe (I could be > wrong) it supports referential integrity constraints, concurrency > control, recoverable transactions, and other features I would expect > out of a reasonable DBMS. Granted it has hooks that allow you to > get it to work with a DBMS that can provide all that, but that > doesn't make SIM itself a DBMS. I would instead class SIM as an > information retrieval system, and a pretty damned good one at that. > However, SIM performs as well as it does in great part because it's > not doing the extra work that a DBMS should do, and which add > greatly to retrieval time from database systems (as well as limiting > their ability to handle complex data formats gracefully). Thank you, Jerome, for the candid and quite fair assessment of SIM. On the point of referential integrity, you are quite right, there is no built in support. Though with our new event hook mechanism (similar to the triggers found in most relational systems) one will be able to attach event handlers to various update operations, and prevent them from completing in the event of a referential integrity violation. This probably wouldn't work together with concurrency controls (thought this will be moot when transaction support comes in). However, in one particular project, we have put in referential integrity control using a single query per reference as part of the check-in mechanism. Another project only generates references dynamically at query time effectively with a single reverse-reference index lookup at query time. The problem with referential integrity checking is sometimes you need to be able to manage broken data and this is more often the case with documents than with the more typical applications of RDBMS technology (financial transactions etc). Of course when you store whole documents instead of unnaturally breaking them up into millions of tiny pieces, you don't have nearly the same referential integrity problems in the first place. With respect to concurrency control you are mistaken. We support short term locks, which prevent individual records, at least, from ever entering an undefined state under concurrent loads. These locks can be held as long as desired, but cannot persist beyond the lifetime of a session. Long term locks (which outlive the session) are in the offing, and stand a good chance of getting into release 3.0 (scheduled for mid-year, I think -- it could be earlier). Transactions we most definitely do not support. We do, however, provide recovery through log files, which record server activity and can be played back in a batch load operation. It's a little crude (you make the server read-only, back it up, and start a new log file. When you crash, restore the last backup and replay the log) but it is safe and effective. More important than any specifics, however, is the issue of what you call a DBMS. To me, a DBMS is a database management system (seems painfully obvious, but I think it bears repeating). You may argue that a product is not a DBMS if it does not support feature X, and I don't entirely disagree. When one talks of a DBMS one is conjuring up a certain image in the mind of the listener, and that image may well include feature X. To be fair to SIM, however, the essence of a DBMS is that it manages a collection of data. If it doesn't support transactions, this does not entail that it does not manage data. Rather it simply has limits on the way the data is managed (i.e. it doesn't manage data as well as one would like). You clearly believe that transaction support is part of the essence of what makes a DBMS. I disagree, indeed, I profoundly disagree. There is nothing in the concept of a database that mandates any such requirement. Rather I would say that transaction support is an important issue for any _good_ DBMS. Likewise for referential integrity and concurrency (and, for that matter, support for declarative queries, use of indexes, a rich set of fundamental data types, etc.). If I recall correctly, dBase III was generally acknowledged to be a DBMS though it lacked most of these requirements, and could barely even call itself relational! Now, don't get me wrong here. I am not trying to defend SIM by deprecating the features you demand. They are very important and highly desirable features in a DBMS (the fact that they are amazingly difficult to do well is of no concern to the user). Their absence in SIM is of ongoing concern to us. Furthermore it is far from satisfying to be able to insist that, SIM fits into a strict, minimalist definition of a DBMS if it lacks features that are typically associated with DBMS's. One of the primary reasons they are not in at this stage is that, as you pointed out so well, the primary focus of SIM has always been performance and scalability; and all of the aforementioned features can have a significant impact on performance if implemented naively (transaction support, in particular, is an onerous requirement, though by no means untenable). SIM is not a full featured DBMS. But it is not a mere informaton retrieval system either. It does support recovery (though not full transaction support), it does support concurrency, and it can be coerced to support referential integrity. It also bears mentioning that you don't have to talk out to an RDBMS to do any of these things. In fact the only use I have heard of for our ODBC capability is one client who wanted to access a personnel database for authentication purposes (it had nothing to with the database server per se). I guess this all boils down to what's in a name. At the end of the day, it is far more important to know what a product does and does not do than what you call it. > This isn't to knock SIM; anyone who needs a flexible information > retrieval system should be taking a very serious look at it. The > Z39.50 support alone puts it way ahead of the market as far as I'm > concerned. But I don't think SIM is evidence that there are DBMS > systems that handle SGML/XML well; I don't think they do. Oracle > may very well be getting there with its latest release, but I > suspect there's still a lot of work to be done there. I am sceptical that any RDBMS vendor can come to the party in terms of performance. Past attempts to try to force text into a relational, table or object based paradigm have not reaped great success (Oracle's ConText comes to mind as an example of how forcing a square peg into a round hole requires sacrificing the edges of performance). I would be surprised if any of the major database vendors would be prepared to venture away from their core competency (the relational model) to address the performance issues. But why parse XML to split it up into tables when you can store the XML directly? Why build thousands of index entries to system generated element ID's so that you can do join's to build up an XML fragment, when you can build a single index and pull the fragment in its entirety out of the document from which it comes? Why use inferior content indexing technology taking up to 10 to 20 times the size of the data being indexed when you can use compressed inverted files which take between 15% (document level index) and 50% (multi-level word position index) the size of the data? And all this with faster update speed than many standard text retrieval systems. There is an additional overhead in the relational paradigm which has nothing to do with transactions, concurrency control, or referential integrity checking. That cost is that relational tables do not map cleanly onto hierarchical documents (or data collections to pick up on another thread). Every fragment you insert, update, or remove has to be taken apart to map it onto some underlying representation, modified piece by piece, and then reassembled to be delivered. I strongly disagree that SIM doesn't handle SGML/XML well. In the five years of successfully selling SIM, no customer has ever replaced SIM with another product. In fact none of them have even mentioned to us that they ever considered replacing SIM. This in itself is remarkable given that, because our customers use SIM to store their SGML/XML natively, they can get the data out of SIM much more easily than if it were mapped onto some proprietary internal database format. People buy SIM because it is flexible enough to do whatever they need to do with their XML/SGML. It doesn't force them to adopt a non-XML/SGML approach. It doesn't force them to translate their data into some proprietary format in order to interact with the data. It deals directly with the XML. Precisely what the original post was asking for, in fact. Cheers, Marcelo P.S.: Some thanks go to my colleague, Tim Arnold-Moore, for providing some of the content (including the closing) for this article. -- http://www.simdb.com/~marcelo/ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|

Cart



