[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Processing huge XML files


huge xml
Hi all,

Really thanks for your valuable advice. Let me give you more info for my 
case. In fact, we are required to access the different parsed data 
values in the file at high performance although we know the access 
patterns for our specific application. (I mean the access is not totally 
random.) So it's good to have an efficient persistent data structure for 
the parsed XML data file. At best, the data structure is generic (to the 
XML schema and access patterns) enough to support fast data access. But 
at least, we are looking for a method to implement a data structure 
customized for a specific XML schema and the defined access pattern. I'm 
looking at different technologies that some of you have suggested. Other 
suggestions are most welcome.

Thanks again,
Thomas

Rick Jelliffe wrote:
> From: "Michael Kay" <michael.h.kay@n...>
> 
>>But really, when you get above 50Mb or so, you need to start looking at
>>XML databases. 
> 
> 
> Another approach is to use steaming languages such as Perl and OmniMark,
> (and, I guess, Python?) especially if you are not updating the data just extracting information.
> 
> Of course, you may need to take several passes.  And you may need to
> have one pass of the data generate a program to be used for then next
> pass, a venerable technique that is often overlooked.  But multiple
> passes with streaming languages is the way that many large scale
> publishing systems work.  A lot can depend on whether your document
> has an order that is amenable to your application: storing metadata
> and keys before the data in particular. 
> 
> A very typical way of constructing streaming programs on large 
> data sets is to do two passes:
>   1) Run over the data and extract all information that will be needed for 
>     decisions that otherwise require random access or lookahead.
>   2) Run over the data and perform the extractions/analysis, using the
>     decision points. 
> 
> Cheers
> Rick Jelliffe
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
> 


-- 
   Thomas Y.T. LEE
   Chief Technology Officer
   Center for E-Commerce Infrastructure Development (CECID)
   Department of Computer Science and Information Systems
   The University of Hong Kong
   E-mail: ytlee@c...  URL: http://www.cecid.hku.hk
   Tel: +852 22415388  Fax: +852 25474611
   Room 301, Chow Yei Ching Building
   Pokfulam Road, Hong Kong SAR, China


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.