[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: String interning (WAS: SAX2/Java: Towards a final form)

  • From: David Megginson <david@m...>
  • To: Miles Sabin <msabin@c...>
  • Date: 14 Jan 2000 13:32:50 -0500

string parsing in java
Miles Sabin <msabin@c...> writes:

> One is where SAX isn't sitting on top of a parser (this is Arkin's
> worry). Instead it's generating SAX events from a DOM tree, java
> reflection, or some other data structure, a JDBC query perhaps.

I was very concerned about this use case at first, but my concerns
lessened a bit once I started to consider implementation details.

If I'm writing a filter, where do the strings for the names I'm
passing on come from?

Well, most of the time, they'll come from upstream in the filter
chain, so they're already interned and I don't have to worry about
it.

Now, let's say that instead I want to introduce my own names, so that
(for example) I rename every "foo" element to "bar".  Good news!  The
string literal that I use in my code for "bar" is already interned
automatically, so there's nothing to worry about.

The only problem comes if my filter is reading names dynamically from
an external source, like a database or a non-XML text file, and
introducing them into the filter stream: in that case, the filter
would be required to invoke some kind of interning function for all of
the names.  

Note that this applies only when element or attribute *names* are
being read from the external source, not when attribute values or
character data content is.  For example, imagine that I have some
database tables that I'm always going to dump into the same XML
structure:

  <employee id="E12345">
   <name>David Megginson</name>
   <position>Grand Poohbah</position>
   <salary>Underpaid</salary>
  </employee>

There's no problem with interning here, because the string literals
that my filter uses for "employee", "name", "position", and "salary"
are already interned by the Java VM.

Iterating over a DOM, on the other hand, is a legitimate problem.
Every DOM implementation worth its salt will have interned all element
and attribute names (a DOM tree is big enough already), but there's no
way to be sure of that in the general case, or to be sure that the
names are == the results of java.lang.String.intern().  Too bad the
DOM level one Java binding didn't require that.

> The other scenario is mine (multiple parsers running over
> arbitrary documents in multiple threads) where the global
> String.intern() map is a point of contention. I won't bore
> everyone with the details again.

I'm much more skeptical about this one, because there are so many
preconditions:

1. you have to have many SAX parsers running in many threads on the
   same system;

2. the SAX parsers have to be being reused over and over in a
   time-critical environment;

3. the XML documents being processed have to be extremely
   heterogenous, or else each parser will have seen most of the
   available names after the first five or ten documents; AND

4. the rest of the parsing process has to be fast and interning has
   to be slow enough that there's serious contention for the interning
   Hashtable even when each parser is looking up only 20-30 names
   (perhaps fewer) for each parse.

If all of these conditions arise at the same time (and I question #3
and #4), then perhaps over-all XML parsing might slow down by 1-2%; if
the actual XML parsing represents even as much as 30% of the
processing time (the rest is taken by whatever the ContentHandler
callbacks do with the information), that's a 0.6% slowdown under these
circumstances.

Granted, the potential speedup for other apps probably isn't much
greater, but since the vast majority of SAX apps will not meet the
above criteria, and since the penalty when one does meet these
criteria is so small, it makes sense not to penalize everyone else.
If there's any real concern, I think, it's the DOM scenario.

[snip big case statement example]

> To be honest, tho', I don't see any particular reason why the
> SAX API should be expected to support this sort of code.

How about running in a tight loop?

  int len = atts.getLength();
  for (int i = 0; i < len; i++) {
    String name = atts.getName(i);
    if (atts.getURI(i) == "http://www.w3.org/1999/02/22-rdf-syntax-ns#") {
      if (name == "about") {
        do something
      } else if (name == "ID") {
        do something
      } else if (name == "aboutEach") {
        do something
      }
    } else if (atts.getURI(i) == "http://www.w3.org/1999/xhtml") {
      if (name == "href") {
        do something
      } else if (name == "class") {
        do something
      } else if (name == "name") {
        do something
      }
    }
  }


All the best,


David

-- 
David Megginson                 david@m...
           http://www.megginson.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.