[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: "Heap" of trouble handling input file of 500 MByte

Subject: Re: "Heap" of trouble handling input file of 500 MByte
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Tue, 22 Feb 2011 09:02:30 +0000
Re:  "Heap" of trouble handling input file of 500 MByte
IIRC some time back the recommendation used to be 10x Mike?
If that's correct, what's changed please? Just Saxon getting smarter?

I think I used to say 10x before the TinyTree came along, but that's a very long time ago. Since the introduction of the TinyTree any improvements have been relatively minor (e.g. whitespace compression). 4x is probably the best you'll achieve, but I've seen a number of people report that. A more detailed sizing (assuming no attribute nodes, no type information, no backwards navigation, and no keys) is:

19 bytes per element node
19 bytes for a whitespace text node
19 + 2x bytes for a non-whitespace text node, where x is the number of characters

It's not unusual to see documents where most of the lines are say 40 characters long, and account for one element, one whitespace text node, and one 20-byte text node, which means 40 bytes of source translates to 97 bytes of TinyTree space, giving an expansion factor of 2.5.

In my IEEE Data Engineering paper a couple of years ago at http://sites.computer.org/debull/A08dec/saxonica.pdf , I measured the memory occupied by the 100Mbyte XMark test document at 327Mbytes, and this agreed well with the theoretical sizing.

Michael Kay

Current Thread


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.