[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: VTD-XML an open-source, high-performance and non-extractiv

  • To: XML Developers List <xml-dev@l...>
  • Subject: Re: VTD-XML an open-source, high-performance and non-extractive XML processing API
  • From: Michael Champion <michaelc.champion@g...>
  • Date: Tue, 18 Oct 2005 10:01:19 -0700
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=OpW8eyimdCwUjCpCCls9MmndgiLVhyLt86Ct0rsFS6+oAGaLhcYQIYtno/n9lZXt/5R4ZGGmPlreXCuEhtOErqSdDUCShkd99SRrSr3SMLKMtG6MiSac2h68dyBDyVoORzHZLc+K2He8htbcxzt2ag+BVcMzeNIpSItUbpoyuKs=
  • In-reply-to: <4354C0AA.2020502@m...>
  • References: <bbe40a480510171950m6fb4db02u7d913de27e936d14@m...> <4354C0AA.2020502@m...>

xml amps
On 10/18/05, Elliotte Harold <elharo@m...> wrote:

>
> On VTD-XML itself,  I read on the web site that "Currently it only
> supports built-in entity references(&quot; &amps; &apos; &gt; &lt;)."
> That means it's not an XML parser. Given this, the comparisons you make
> to other parsers are unfair and misleading. I've seen many products that
> outperform real XML parsers by subsetting XML and cutting out the hard
> parts. It's often the last 10% that kills the performance. :-(

Well, they do say right up front: "VTD-XML is a non-validating,
'non-extractive" XML processing software API implementing Virtual
Token Descriptor. Currently it only supports built-in entity
references(&quot; &amps; &apos; &gt; &lt;).'  Arguably an XML
processing API doesn't have to be a real XML parser *if* the subset it
supports is clearly stated.  I would have to agree that in principle
"XML" should be used to refer only to the full spec, but that battle
was lost years ago -- SOAP implicitly subsets XML, RSS is often not
well-formed (and thus not "XML"), but this distinction is lost on the
vast majority of XML technology users who do not subscribe to xml-dev.

As with most things in life, people need to just pick their poison. 
Given the efficiency issues, is it better to subset XML and process
something that looks a lot like real XML efficiently with tools such
as VTD-XML, is it better to build a more fully conformant Efficient
XML Interchange (the sanitized term for what we used to call "binary
XML"), is it better to lower customer expectations about
performance/bandwidth consumption, or what?  None of them are
palatable, but people have to choose which is least toxic to their own
scenario.

>
> The other question I have for anything claiming these speed gains is
> whether it correctly implements well-formedness testing, including the
> internal DTD subset. Will VTD-XML correctly report all malformed
> documents as malformed?

>
> Finally, even if everything works out once the holes are plugged,  this
> seems like it would be slower than SAX/StAX for streaming use cases.
> VTD, like DOM, needs to read the entire document before it can work on
> any of it.

I think the point is that the process that creates the XML can confirm
that it is well-formed / valid, and produce a VTD associated with a
document/message, then downstream processes that understand VTD can
exploit it.  Those that do not  understand VTD can simply use the XML
text.  Yes this requires a level of trust in the producer  that pure
XML text processing does not require.  I've always seen this as
hitting a sweet spot (for *some* use cases!) between text XML and
binary XML where the designers of an application decide that the cost
of verifying that the producer got the XML right outweighs the
benefits of catching the errors.  We can argue about how common those
scenarios are, of course, but at any point in the processing chain, a
specific component can ignore the VTD and parse the XML to verify
whatever needs to be verified.

Obviously VTD doesn't reduce the size of the XML transmitted, so it
doesn't meet the use cases that the W3C XBC / EXI folks are focused
on.  On the other hand, it sounds promising for messaging scenarios
with multiple intermediaries that do routing, filtering, DSig
verification, and perhaps encryption -- raw  XML parsing is quite
expensive, but could be accelerated by using the VTD to quickly find
the offsets in the message that a particular intermediary knows/cares
about. Obviously that doesn't work at all for infinite streams of XML.

Overall, my concern is that we as an industry neither look for magic
fixes that solve all known efficiency  problems (which arguably the
W3C is about to futilely attempt to do) nor reject approaches, e.g.
VTD, that pluck some low-hanging fruit but don't handle all use cases.

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.