[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Paper with an order of magnitude speed increase forparsing

  • From: "Liam R. E. Quin" <liam@fromoldbooks.org>
  • To: Rick Jelliffe <rjelliffe@allette.com.au>, xml-dev <xml-dev@l...>
  • Date: Wed, 02 Jan 2019 12:29:33 -0500

Re:  Paper with an order of magnitude speed increase forparsing
On Wed, 2019-01-02 at 22:00 +1100, Rick Jelliffe wrote:
> [...]
>  has the old answer of preprocessing files through grep (etc) to find
> candidates now respectable again?

The tradeoff for XML is generally that reading the file twice (once to
work out whether you want to parse it, and once to parse it) is likely
to be slower than parsing in many cases, depending on the data
structures you build. But there are things that can change this:

* a persistent index, e.g. a full-text database, can sometimes answer
the question of which XML file(s) to load without having to load them:
the index can be much smaller than the files, and/or can exploit Zipf's
Law to look at only a fraction of the index. But if you're going to do
this, why not use what i call a fast-forest store, perhaps with an
XQuery interface?

* On a multi-CPU system, if you have millions of tiny XML files, a
thread that pre-reads the files will make parsing go much faster, as
the next file will usually be in the disk cache (at one time my text
retrieval system took advantage of this, but i had to remove it to
support Microsoft Windows years ago - it's very platform-specific.

* Modern server storage in some cases is faster than main memory, or as
fast as the bus speed. So a disk cache is pointless, and scanning files
might be cheap. But this storage is expensive, so using it for a
database index may make more sense.

So in the end you have to measure.

Reading the actual paper, 4,000 lines of C to parse JSON more quickly
seems a lot, especially when part of the motivation is that loading
into Hadoop is slow. But it's a research paper. See also
https://blog.cloudera.com/blog/2017/02/performance-comparing-of-different-file-formats-and-storage-engines-in-hadoop-file-system/

Big data analytics, in which large amounts of data might never be
parsed, is likely a very different beast from technical documentation
(say), where every file is probably read (from an XML db or from disk)
many times more often than it's written.

Best,

Liam



-- 
Liam Quin, https://www.holoweb.net/liam/cv/
Web slave for vintage clipart http://www.fromoldbooks.org/
Available for XML/Document/Information Architecture/
XSL/XQuery/Web/Text Processing/A11Y work & consulting.



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.