[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Paper with an order of magnitude speed increase forparsing
On Wed, 2019-01-02 at 22:00 +1100, Rick Jelliffe wrote: > [...] > has the old answer of preprocessing files through grep (etc) to find > candidates now respectable again? The tradeoff for XML is generally that reading the file twice (once to work out whether you want to parse it, and once to parse it) is likely to be slower than parsing in many cases, depending on the data structures you build. But there are things that can change this: * a persistent index, e.g. a full-text database, can sometimes answer the question of which XML file(s) to load without having to load them: the index can be much smaller than the files, and/or can exploit Zipf's Law to look at only a fraction of the index. But if you're going to do this, why not use what i call a fast-forest store, perhaps with an XQuery interface? * On a multi-CPU system, if you have millions of tiny XML files, a thread that pre-reads the files will make parsing go much faster, as the next file will usually be in the disk cache (at one time my text retrieval system took advantage of this, but i had to remove it to support Microsoft Windows years ago - it's very platform-specific. * Modern server storage in some cases is faster than main memory, or as fast as the bus speed. So a disk cache is pointless, and scanning files might be cheap. But this storage is expensive, so using it for a database index may make more sense. So in the end you have to measure. Reading the actual paper, 4,000 lines of C to parse JSON more quickly seems a lot, especially when part of the motivation is that loading into Hadoop is slow. But it's a research paper. See also https://blog.cloudera.com/blog/2017/02/performance-comparing-of-different-file-formats-and-storage-engines-in-hadoop-file-system/ Big data analytics, in which large amounts of data might never be parsed, is likely a very different beast from technical documentation (say), where every file is probably read (from an XML db or from disk) many times more often than it's written. Best, Liam -- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|